By 苏剑林 | October 10, 2018
This article is very brief. It mainly describes a fact that is very useful and not complex, but I surprisingly only discovered it after such a long time—
In the article "Mutual Information in Deep Learning: Unsupervised Feature Extraction", we obtained the final loss for the Deep INFOMAX model through a weighted combination of a prior distribution loss and a mutual information maximization loss. In that article, while the story was told in full, in a sense, it was just a pieced-together loss. This article will prove that the loss can be naturally derived from the Variational Autoencoder (VAE).
Process
To repeat it once more, the loss that a Variational Autoencoder (VAE) needs to optimize is
$$KL(\tilde{p}(x)p(z|x)\Vert q(z)q(x|z))$$
Relevant discussions have appeared on this blog many times. VAE contains both an encoder and a decoder. If we only need to encode features, then training a decoder seems redundant. So the focus is on how to remove the decoder.
Actually, it's quite simple. Split the VAE loss into two parts:
\begin{equation}
\begin{aligned}
&KL(\tilde{p}(x)p(z|x)\Vert q(z)q(x|z))\\
=&\iint \tilde{p}(x)p(z|x)\log \frac{p(z|x)}{q(z)} dzdx-\iint \tilde{p}(x)p(z|x)\log \frac{q(x|z)}{\tilde{p}(x)} dzdx
\end{aligned}
\end{equation}
The first term is the KL divergence of the prior distribution. The term $\log \frac{q(x|z)}{\tilde{p}(x)}$ in the second term is actually the pointwise mutual information of $x$ and $z$. If $q(x|z)$ has infinite fitting capability, eventually it must be that $\tilde{p}(x)p(z|x) = q(x|z)p(z)$ (Bayes' formula), so the second term is
\begin{equation}KL(q(x|z)p(z)\Vert \tilde{p}(x)p(z))=KL(\tilde{p}(x)p(z|x)\Vert \tilde{p}(x)p(z))\end{equation}
which is the mutual information of the two random variables $x$ and $z$. The preceding negative sign means we want to maximize the mutual information.
The remaining process is the same as in "Mutual Information in Deep Learning: Unsupervised Feature Extraction", omitted.
Conclusion
As stated at the beginning, this article is very brief and doesn't contain much content. The main purpose is to provide a new understanding of the Variational Autoencoder's loss (minimizing prior distribution + maximizing mutual information), and from there, one can naturally derive the loss for Deep INFOMAX.
If I hadn't yet written "Mutual Information in Deep Learning: Unsupervised Feature Extraction", I would have certainly used this starting point to explain Deep INFOMAX. However, since that article has already been written for several days, I had to open this short post to provide a supplementary explanation.