Variational Autoencoder = Minimizing Prior Distribution + Maximizing Mutual Information

By 苏剑林 | October 10, 2018

This article is very brief. It mainly describes a fact that is very useful and not complex, but I surprisingly only discovered it after such a long time—

In the article "Mutual Information in Deep Learning: Unsupervised Feature Extraction", we obtained the final loss for the Deep INFOMAX model through a weighted combination of a prior distribution loss and a mutual information maximization loss. In that article, while the story was told in full, in a sense, it was just a pieced-together loss. This article will prove that the loss can be naturally derived from the Variational Autoencoder (VAE).

Process

To repeat it once more, the loss that a Variational Autoencoder (VAE) needs to optimize is

$$KL(\tilde{p}(x)p(z|x)\Vert q(z)q(x|z))$$

Relevant discussions have appeared on this blog many times. VAE contains both an encoder and a decoder. If we only need to encode features, then training a decoder seems redundant. So the focus is on how to remove the decoder.

Actually, it's quite simple. Split the VAE loss into two parts:

\begin{equation} \begin{aligned} &KL(\tilde{p}(x)p(z|x)\Vert q(z)q(x|z))\\ =&\iint \tilde{p}(x)p(z|x)\log \frac{p(z|x)}{q(z)} dzdx-\iint \tilde{p}(x)p(z|x)\log \frac{q(x|z)}{\tilde{p}(x)} dzdx \end{aligned} \end{equation}

The first term is the KL divergence of the prior distribution. The term $\log \frac{q(x|z)}{\tilde{p}(x)}$ in the second term is actually the pointwise mutual information of $x$ and $z$. If $q(x|z)$ has infinite fitting capability, eventually it must be that $\tilde{p}(x)p(z|x) = q(x|z)p(z)$ (Bayes' formula), so the second term is

\begin{equation}KL(q(x|z)p(z)\Vert \tilde{p}(x)p(z))=KL(\tilde{p}(x)p(z|x)\Vert \tilde{p}(x)p(z))\end{equation}

which is the mutual information of the two random variables $x$ and $z$. The preceding negative sign means we want to maximize the mutual information.

The remaining process is the same as in "Mutual Information in Deep Learning: Unsupervised Feature Extraction", omitted.

Conclusion

As stated at the beginning, this article is very brief and doesn't contain much content. The main purpose is to provide a new understanding of the Variational Autoencoder's loss (minimizing prior distribution + maximizing mutual information), and from there, one can naturally derive the loss for Deep INFOMAX.

If I hadn't yet written "Mutual Information in Deep Learning: Unsupervised Feature Extraction", I would have certainly used this starting point to explain Deep INFOMAX. However, since that article has already been written for several days, I had to open this short post to provide a supplementary explanation.