Variational Autoencoder (8): Estimating Sample Probability Density

By 苏剑林 | December 09, 2021

In the previous articles of this series, we have understood VAE from multiple perspectives. Generally, VAE is used to obtain a generative model or a better encoding model, which are the standard applications of VAE. However, besides these common uses, there are some "niche requirements," such as estimating the probability density of $x$, which is often used in tasks like compression.

This article will explore and derive the VAE model from the perspective of estimating probability density.

Two Problems

The estimation of probability density involves taking known samples $x_1, x_2, \dots, x_N \sim \tilde{p}(x)$ and using a parametric family of probability densities $q_{\theta}(x)$ to fit these samples. The general goal is to minimize the negative log-likelihood:

\begin{equation}\mathbb{E}_{x\sim \tilde{p}(x)}[-\log q_{\theta}(x)] = -\frac{1}{N}\sum_{i=1}^N \log q_{\theta}(x_i)\label{eq:mle}\end{equation}

But this is purely a theoretical form, and several problems remain to be solved, primarily categorized into two major questions:

1. What kind of $q_{\theta}(x)$ should be used for fitting?
2. What method should be used to solve the objective?

Mixture Models

Regarding the first problem, we naturally hope that $q_{\theta}(x)$ has a fitting capability that is as strong as possible—ideally, it should be able to fit any probability distribution. Unfortunately, while neural networks theoretically have universal approximation capabilities for functions, that is for fitting functions, not probability distributions. A probability distribution must satisfy $q_{\theta}(x) \geq 0$ and $\int q_{\theta}(x) dx = 1$, the latter of which is often difficult to guarantee.

Since a direct approach is difficult, we consider an indirect perspective by constructing a mixture model:

\begin{equation}q_{\theta}(x) = \int q_{\theta}(x|z)q(z)dz=\mathbb{E}_{z\sim q(z)}[q_{\theta}(x|z)]\label{eq:q}\end{equation}

Where $q(z)$ is typically chosen as a simple distribution without parameters, such as the standard normal distribution; and $q_{\theta}(x|z)$ is a parametric simple distribution conditioned on $z$, such as a standard normal distribution whose mean and variance are related to $z$.

From the perspective of generative models, the above model is interpreted as a two-step operation: first sampling $z$ from $q(z)$, and then passing it to $q_{\theta}(x|z)$ to generate $x$. However, the focus of this article is on estimating probability density. We choose such a $q_{\theta}(x|z)$ because it provides sufficient capability to fit complex distributions. The final $q_{\theta}(x)$ is expressed as the average of multiple simple distributions $q_{\theta}(x|z)$. Readers familiar with Gaussian Mixture Models (GMM) should know that such a model can have very strong fitting capabilities, theoretically even fitting any distribution, so the capability to fit distributions is guaranteed.

Importance Sampling

However, the expression in Equation $\eqref{eq:q}$ cannot be integrated simply. Or rather, only distributions that cannot be expressed simply and explicitly have sufficiently strong fitting capabilities. Therefore, if we want to estimate it, we must follow the $\mathbb{E}_{z\sim q(z)}[q_{\theta}(x|z)]$ approach using sampling estimation. However, in practical scenarios, the dimensions of $z$ and $x$ are quite high, and high-dimensional spaces suffer from the "curse of dimensionality." This means that in high-dimensional space, even if we sample millions or tens of millions of samples, it is difficult to adequately cover the space, making it hard to accurately estimate $\mathbb{E}_{z\sim q(z)}[q_{\theta}(x|z)]$.

To this end, we need to find a way to narrow down the sampling space. First, we usually control the variance of $q_{\theta}(x|z)$ to be relatively small. As a result, for a given $x$, the values of $z$ that make $q_{\theta}(x|z)$ significantly large will not be many; for most $z$, the calculated $q_{\theta}(x|z)$ will be very close to zero. Thus, we only need to find a way to sample $z$ values that make $q_{\theta}(x|z)$ relatively large to obtain a good estimation of $\mathbb{E}_{z\sim q(z)}[q_{\theta}(x|z)]$.

Specifically, we introduce a new distribution $p_{\theta}(z|x)$, assuming that the $z$ values making $q_{\theta}(x|z)$ large follow this distribution. Thus, we have:

\[q_{\theta}(x) = \int q_{\theta}(x|z)q(z)dz=\int q_{\theta}(x|z)\frac{q(z)}{p_{\theta}(z|x)}p_{\theta}(z|x)dz=\mathbb{E}_{z\sim p_{\theta}(z|x)}\left[q_{\theta}(x|z)\frac{q(z)}{p_{\theta}(z|x)}\right]\]

In this way, we transform "purposeless" sampling from $q(z)$ into a more targeted sampling from $p_{\theta}(z|x)$. Because the variance of $q_{\theta}(x|z)$ is controlled to be small, the variance of $p_{\theta}(z|x)$ will naturally not be large, and the sampling efficiency increases. Note that in the generative model view, $p_{\theta}(z|x)$ is seen as an approximation of the posterior distribution, but from the perspective of estimating probability density, it is purely an importance weighting function; there is no need for a special interpretation of its meaning.

Training Objective

At this point, we have solved the first problem: what distribution to use and how to better calculate this distribution. The remaining question is how to train it.

In fact, once we have the concept of importance sampling, we don't need to consider things like ELBO. We can directly use the objective in Equation $\eqref{eq:mle}$. Substituting the expression for $q_{\theta}(x)$, we get:

\[\mathbb{E}_{x\sim \tilde{p}(x)}\left[-\log \mathbb{E}_{z\sim p_{\theta}(z|x)}\left[q_{\theta}(x|z)\frac{q(z)}{p_{\theta}(z|x)}\right]\right]\]

In fact, if we sample only one $z$ via reparameterization for the step $\mathbb{E}_{z\sim p_{\theta}(z|x)}$, the training objective becomes:

\[\mathbb{E}_{x\sim \tilde{p}(x)}\left[-\log q_{\theta}(x|z)\frac{q(z)}{p_{\theta}(z|x)}\right],\quad z\sim p_{\theta}(z|x)\]

This is already the training objective of a conventional VAE. If we sample $M > 1$ points, then it is:

\[\mathbb{E}_{x\sim \tilde{p}(x)}\left[-\log \left(\frac{1}{M}\sum_{i=1}^M q_{\theta}(x|z_i)\frac{q(z_i)}{p_{\theta}(z_i|x)}\right)\right],\quad z_1,z_2,\dots,z_M\sim p_{\theta}(z|x)\]

This is the "Importance Weighted Autoencoder," originating from "Importance Weighted Autoencoders", which is considered an enhancement of VAE. In summary, through the lens of importance sampling, we can bypass the tedious derivations of traditional VAE such as ELBO, and we also do not need the joint distribution perspective introduced in "Variational Autoencoder (2): From a Bayesian Perspective"; we can directly obtain the VAE model and even its improved versions.

Summary

This article introduced the Variational Autoencoder (VAE) from the starting point of estimating sample probability density. By combining importance sampling, we can obtain a rapid derivation of VAE, completely avoiding many tedious details like ELBO.