From Variational Encoding and Information Bottleneck to Normal Distribution: On the Importance of Forgetting

By 苏剑林 | November 27, 2018

This is an "essay" where we will discuss three things that are inextricably linked: Variational Autoencoders (VAE), Information Bottleneck (IB), and the Normal Distribution.

As is well known, the Variational Autoencoder is a classic generative model, but it actually carries implications that go beyond generative modeling. Regarding Information Bottleneck, people might be relatively less familiar with it, although it did attract quite a bit of attention last year. As for the Normal Distribution, it goes without saying—it is connected to almost every field of machine learning to some extent.

So, what story is there to tell when these three collide? And what do they have to do with "forgetting"?

Variational Autoencoder

You can find several articles introducing VAE on this blog. Below is a brief review.

Review of Theoretical Form

Simply put, the optimization objective of VAE is:

\begin{equation}KL(\tilde{p}(x)p(z|x)\Vert q(z)q(x|z))=\iint \tilde{p}(x)p(z|x)\log \frac{\tilde{p}(x)p(z|x)}{q(x|z)q(z)} dzdx\end{equation}

where $q(z)$ is the standard normal distribution, and $p(z|x), q(x|z)$ are conditional normal distributions corresponding to the encoder and decoder respectively. For specific details, please refer to "Variational Autoencoder (2): From a Bayesian Perspective".

This objective can ultimately be simplified to:

\begin{equation}\mathbb{E}_{x\sim \tilde{p}(x)} \Big[\mathbb{E}_{z\sim p(z|x)}\big[-\log q(x|z)\big]+KL\big(p(z|x)\big\Vert q(z)\big)\Big]\label{eq:vae}\end{equation}

Clearly, it can be viewed in two parts: the term $\mathbb{E}_{z\sim p(z|x)}\big[-\log q(x|z)\big]$ is equivalent to the ordinary autoencoder loss (with the reparameterization trick added), and $KL\big(p(z|x)\big\Vert q(z)\big)$ is the KL divergence between the posterior and the prior. The first term desires the reconstruction loss to be as small as possible, meaning it hopes the latent variable $z$ retains as much information as possible. The second term aims to align the latent space with the normal distribution, meaning it hopes the distribution of the latent variables is more regular.

Comparison with Autoencoders

Therefore, compared to ordinary autoencoders, the changes in VAE are:

1. It introduces the concepts of mean and variance and adds the reparameterization operation;
2. It adds KL divergence as an additional loss function.

Information Bottleneck

I believe enough information about VAE has been introduced on this site, so I will not dwell on it and will immediately move to the introduction of Information Bottleneck (IB).

Unveiling the Black Box of DL?

In September last year, there was a lecture about deep learning and information bottleneck claiming it could unveil the black box of deep learning (DL). After hearing it, the great Hinton commented, "This is so interesting, I need to watch it 10,000 more times..." (see "Opening the Black Box of Deep Learning: Hebrew University Professor Proposes 'Information Bottleneck'"). Subsequently, Information Bottleneck became popular. Shortly after, an article appeared to challenge this result, showing that the conclusion of Information Bottleneck is not universal (see "Puncturing the Bubble: A Critical Analysis of the 'Information Bottleneck' Theory"), making it even more debated.

Regardless of whether information bottleneck can reveal the secrets of deep learning, as practitioners, we primarily look at whether information bottleneck can actually extract something valuable for use. The so-called Information Bottleneck is a relatively simple idea: it says that when facing a task, we should try to complete it using the minimum amount of information. This is actually similar to our previous discussion on the "Minimum Entropy Series", because information corresponds to learning costs. Completing a task with the least information means completing the task with the lowest cost, which implies obtaining a model with better generalization performance.

Principles of Information Bottleneck

Why does lower cost/less information lead to better generalization? This is not hard to understand. For instance, in a company, if we were to customize a solution for every single client and assign a dedicated person to follow up, the cost would be enormous. If we can find a universal solution and only perform minor adjustments based on it later, the cost would be much lower. A "universal solution" exists because we have found the commonalities and patterns of client needs. Thus, it is obvious that a minimum-cost solution means we have found some universal laws and characteristics, which translates to generalization capability.

How do we reflect this in deep learning? The answer is "Variational Information Bottleneck" (VIB), originating from the paper "Deep Variational Information Bottleneck".

Suppose we are facing a classification task with labeled data pairs $(x_1, y_1), \dots, (x_N, y_N)$. We can understand this task in two steps: the first step is encoding, and the second step is classification. The first step encodes $x$ into a latent variable $z$, and then the classifier recognizes $z$ as the category $y$.

$$x \quad \to \quad z \quad \to \quad y$$

Then imagine adding a "bottleneck" $\beta$ at the position of $z$. It is like an hourglass; the amount of incoming information may be large, but the exit is only as large as $\beta$. So the function of this bottleneck is: it does not allow the amount of information flowing through $z$ to exceed $\beta$. Unlike an hourglass where sand just passes through, after information passes through the bottleneck, it still needs to complete its task (classification, regression, etc.). Therefore, the model is forced to find a way to let only the most important information pass through the bottleneck. This is the principle of information bottleneck.

Variational Information Bottleneck

How is this operated quantitatively? We use "mutual information" as a metric to measure the amount of information passed through:

\begin{equation}\iint p(z|x)\tilde{p}(x)\log \frac{p(z|x)}{p(z)}dxdz\end{equation}

Here $p(z)$ is not an arbitrarily specified distribution but the true distribution of the latent variables. Theoretically, after knowing $p(z|x)$, we can calculate $p(z)$ because it formally equals:

\begin{equation}p(z) = \int p(z|x)\tilde{p}(x)dx\end{equation}

Of course, this integral is often difficult to calculate, so we will look for another way later.

Then, we also have the task loss, such as cross-entropy for a classification task:

\begin{equation}-\iint p(z|x)\tilde{p}(x)\log p(y|z)dxdz\end{equation}

Written in this form, it indicates that we have an encoder that first encodes $x$ into $z$, and then classifies $z$.

How do we "disallow the information flowing through $z$ to exceed $\beta$"? We can directly add it as a penalty term, making the final loss:

\begin{equation}-\iint p(z|x)\tilde{p}(x)\log p(y|z)dxdz + \lambda \iint p(z|x)\tilde{p}(x)\max\left(\log \frac{p(z|x)}{p(z)} - \beta, 0\right)dxdz\end{equation}

That is to say, after the mutual information exceeds $\beta$, a positive penalty term appears. Of course, in many cases, we don't know what $\beta$ should be set to, so a more straightforward approach is to remove $\beta$, obtaining:

\begin{equation}-\iint p(z|x)\tilde{p}(x)\log p(y|z)dxdz + \lambda \iint p(z|x)\tilde{p}(x)\log \frac{p(z|x)}{p(z)}dxdz\end{equation}

This simply desires the amount of information to be as small as possible without setting a specific threshold.

Now that we have the formula, but since we mentioned $p(z)$ cannot be calculated, we have to estimate its upper bound: assuming $q(z)$ is a distribution with a known form, we have:

\begin{equation}\begin{aligned}&\iint p(z|x)\tilde{p}(x)\log \frac{p(z|x)}{p(z)}dxdz\\ =&\iint p(z|x)\tilde{p}(x)\log \frac{p(z|x)}{q(z)}\frac{q(z)}{p(z)}dxdz\\ =&\iint p(z|x)\tilde{p}(x)\log \frac{p(z|x)}{q(z)} + \iint p(z|x)\tilde{p}(x)\log \frac{q(z)}{p(z)}dxdz\\ =&\iint p(z|x)\tilde{p}(x)\log \frac{p(z|x)}{q(z)} + \int p(z)\log \frac{q(z)}{p(z)}dz\\ =&\iint p(z|x)\tilde{p}(x)\log \frac{p(z|x)}{q(z)} - \int p(z)\log \frac{p(z)}{q(z)}dz\\ =&\int \tilde{p}(x) KL\big(p(z|x)\big\Vert q(z)\big) dx - KL\big(p(z)\big\Vert q(z)\big)\\ < &\int \tilde{p}(x) KL\big(p(z|x)\big\Vert q(z)\big) dx\end{aligned}\end{equation}

This shows that $\int\tilde{p}(x) KL\big(p(z|x)\big\Vert q(z)\big) dx$ is the upper bound of $\iint p(z|x)\tilde{p}(x)\log \frac{p(z|x)}{p(z)}dxdz$. If we optimize the former, the latter will not exceed it. Since the latter cannot be directly calculated, we must optimize the former. Thus, the final usable loss is:

\begin{equation}-\iint p(z|x)\tilde{p}(x)\log p(y|z)dxdz + \lambda \int\tilde{p}(x) KL\big(p(z|x)\big\Vert q(z)\big) dx\end{equation}

Or written equivalently as:

\begin{equation}\mathbb{E}_{x\sim \tilde{p}(x)} \Big[\mathbb{E}_{z\sim p(z|x)}\big[-\log p(y|z)\big]+\lambda\cdot KL\big(p(z|x)\big\Vert q(z)\big)\Big]\label{eq:vib}\end{equation}

This is "Variational Information Bottleneck".

Observations and Implementation

It can be seen that if $q(z)$ is taken as the standard normal distribution (in fact, we always take it this way, so this "if" is satisfied), then equation $\eqref{eq:vib}$ is almost identical to the VAE loss function $\eqref{eq:vae}$. The difference is that $\eqref{eq:vib}$ discusses supervised tasks, while $\eqref{eq:vae}$ is unsupervised learning. However, we can treat VAE as a supervised learning task where the label is itself $x$; then it becomes a special case of $\eqref{eq:vib}$.

So, compared to the original supervised learning task, the changes of Variational Information Bottleneck are:

1. It introduces the concepts of mean and variance and adds the reparameterization operation;
2. It adds KL divergence as an additional loss function.

Exactly like VAE!

Implementing Variational Information Bottleneck in Keras is very simple. I have defined a layer for everyone to call:

from keras.layers import Layer
import keras.backend as K

class VIB(Layer):
    """Variational Information Bottleneck Layer
    """
    def __init__(self, lamb, **kwargs):
        self.lamb = lamb
        super(VIB, self).__init__(**kwargs)
    def call(self, inputs):
        z_mean, z_log_var = inputs
        u = K.random_normal(shape=K.shape(z_mean))
        kl_loss = - 0.5 * K.sum(K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), 0))
        self.add_loss(self.lamb * kl_loss)
        u = K.in_train_phase(u, 0.)
        return z_mean + K.exp(z_log_var / 2) * u
    def compute_output_shape(self, input_shape):
        return input_shape[0]

The usage is simple; just make a slight modification to your original task. Please refer to: https://github.com/bojone/vib/blob/master/cnn_imdb_vib.py

Effect: Compared to the model without VIB, the model with VIB converges faster and easily reaches 89%+ validation accuracy, whereas the model without VIB usually only reaches 88%+ and has a slower convergence speed.

Variational Discriminator Bottleneck

The original paper "Deep Variational Information Bottleneck" showed that VIB is a quite effective regularization method, improving original model performance on multiple tasks.

However, the story of Information Bottleneck does not end there. Not long ago, a paper titled "Variational Discriminator Bottleneck" was chosen as a high-score paper for ICLR 2019 (appearing around the same time as the famous BigGAN). The authors were no longer satisfied with using VIB only for ordinary supervised tasks; the paper developed "Variational Discriminator Bottleneck" and applied it to various tasks such as GANs and Reinforcement Learning, achieving improvements in all of them! The power of information bottleneck is evident.

Unlike equation $\eqref{eq:vib}$, the update mechanism of information bottleneck in "Variational Discriminator Bottleneck" was modified to give it a certain adaptive ability, but the fundamental idea remained unchanged: using restricted mutual information to regularize the model. However, this is no longer the focus of this article. Interested readers please refer to the original paper.

Normal Distribution

Through comparison, we have found that VAE and VIB both just introduce reparameterization into the original task and add the KL divergence term. Intuitively, the role of the regularization term is to hope that the distribution of latent variables is closer to the standard normal distribution. So, what exactly are the benefits of the normal distribution?

Regularity and Decoupling

Actually, to talk about the origins, stories, and uses of the normal distribution, one could write an entire book. Many properties of the normal distribution have appeared elsewhere; here I will only introduce the parts most relevant to this article.

In fact, the role of KL divergence is to align the distribution of the latent variables with a (multivariate) standard normal distribution, rather than any arbitrary normal distribution. The standard normal distribution is relatively regular, having benefits like zero mean and unit variance. More importantly, the standard normal distribution has a very valuable characteristic: its components are decoupled. In probabilistic terms, they are mutually independent, satisfying $p(x, y) = p(x)p(y)$.

We know that if features are mutually independent, modeling becomes much easier (the Naive Bayes classifier is a perfectly exact model in such a case), and mutually independent features have much better interpretability. Therefore, we always hope for features to be independent. As early as 1992, Schmidhuber, the father of LSTM, proposed the Predictability Minimization (PM) model, dedicated to constructing an autoencoder with decoupled features. Related stories can be found in "From PM to GAN - LSTM Father Schmidhuber's 22-Year Grudge". Yes, in the years before I even arrived on Earth, the masters were already working on feature decoupling, showing how valuable it is.

In VAE (including later Adversarial Autoencoders), the latent distribution is directly aligned with a decoupled prior distribution via KL divergence. The benefit of this is that the latent variables themselves become close to being decoupled, thus possessing the various advantages of decoupling mentioned earlier. Now, we can answer a question that might likely be asked:

Question: From the perspective of feature encoding, what are the advantages of Variational Autoencoders compared to ordinary autoencoders?
Answer: VAE makes the latent variable distribution close to the standard normal distribution through KL divergence, thereby decoupling latent features and simplifying downstream models built upon these features. (Of course, you can also answer by mentioning the Variational Information Bottleneck discussed earlier, such as enhancing generalization performance, etc. ^_^)

Linear Interpolation and Convolution

Furthermore, the normal distribution has an important property, which is often used to demonstrate the effects of generative models: linear interpolation.

(Referencing the interpolation effect image from the Glow model)

The process of this linear interpolation is: first sample two random vectors $z_1, z_2 \sim \mathcal{N}(0, 1)$. Clearly, a good generator can generate realistic images $g(z_1), g(z_2)$ from $z_1, z_2$. Then we consider $g(z_{\alpha})$, where:

\begin{equation}z_{\alpha} = (1 - \alpha) z_1 + \alpha z_2,\quad 0 \leq \alpha \leq 1\end{equation}

Considering the transition of $\alpha$ from 0 to 1, we expect to see $g(z_{\alpha})$ as image $g(z_1)$ gradually transitioning into image $g(z_2)$. In fact, this is indeed the case.

Why must interpolation be done in the latent space? Why doesn't directly interpolating original images yield valuable results? This is actually related to the normal distribution because we have the following convolution theorem (this convolution is the mathematical convolution operator, not the convolutional layer of a neural network):

If $z_1 \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $z_2 \sim \mathcal{N}(\mu_2, \sigma_2^2)$ are independent random variables, then:
$$z_1 + z_2 \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$$
In particular, if $z_1 \sim \mathcal{N}(0, 1)$ and $z_2 \sim \mathcal{N}(0, 1)$, then:
$$\alpha z_1 + \beta z_2 \sim \mathcal{N}(0, \alpha^2 + \beta^2)$$

This means the sum of normal random variables is still normal. What does this imply? It implies that in the world of the normal distribution, a linear interpolation of two variables still resides in this world. This is not an ordinary property, because clearly, interpolating two real samples is not necessarily a real image.

For supervised tasks, what is the value of this linear interpolation property? There is one, and it is very important. We know that labeled datasets are hard to come by. If we can reasonably map the latent space of a finite training set to a standard normal distribution, then we can expect that the parts not covered by the training set might also be considered by us, because their latent variables might just be linear interpolations of the original training set's latent variables.

In other words, after we complete supervised training and regularize the latent variable distribution into a standard normal distribution, we have effectively taken the transition samples between training sets into account, thereby covering more samples.

Note: We usually consider uniform linear interpolation in the spatial domain, i.e., in the form of $\beta = 1 - \alpha$. But from the perspective of $\alpha z_1 + \beta z_2 \sim \mathcal{N}(0, \alpha^2 + \beta^2)$, the best choice is interpolation where $\alpha^2 + \beta^2 = 1$, which is:
$$z_{\theta} = z_1 \cdot \cos\theta + z_2 \cdot \sin\theta$$
Secondly, readers might wonder if linear interpolation can also be done when the GAN prior uses a uniform distribution. It seems it's not unique to the normal distribution? In fact, the convolution of a uniform distribution is no longer a uniform distribution, but its probability density function happens to concentrate in the middle of the original uniform distribution (it's just no longer uniform, equivalent to taking a subset of the original), so the interpolation effect is still satisfactory. However, theoretically, it is not as elegant. Furthermore, from practice, the normal distribution is mostly used in current GAN training, and it works better than the uniform distribution.

Learning and Forgetting

Finally, after saying so much, all the content actually has a very intuitive correspondence: Forgetting.

Forgetting is also a very important topic in deep learning, with related research results coming out from time to time. For example, if we fine-tune a trained model using a dataset from a new domain, the model often only works for the new domain and not for both, which is the "catastrophic forgetting" problem in deep learning. Another example is research that came out a while ago saying that among the three gates of LSTM, retaining only the "forgetting gate" is actually sufficient.

As for the long discussion on information bottleneck, it also corresponds to forgetting. Because the capacity of the brain is fixed, you have to complete your task with limited information, which extracts the valuable information. Take the classic example: a bank clerk might be able to identify counterfeit bills just by looking or touching them, but are they very knowledgeable about every detail of the currency? Can they draw the outline of the bank note perfectly? I don't think they can. Because the brain capacity they allocate to complete this task is limited, they only need the most important information to distinguish counterfeit currency. This is the information bottleneck of the brain.

The information bottleneck of deep learning mentioned earlier can be analogized in the same way. It is generally believed that the foundation of a neural network's effectiveness is information loss, losing (forgetting) useless information layer by layer, and finally retaining effective and generalized information. However, neural networks have far too many parameters, and sometimes they don't necessarily achieve this goal. So, Information Bottleneck adds a constraint to the neural network, equivalent to "compelling" the neural network to forget useless information. But because of this, VIB doesn't always improve the performance of your original model. If your model already "loses (forgets) useless information layer by layer and retains effective, generalized information," then VIB is redundant. VIB is just a regularization term; like all other regularization terms, its effect is not absolute.

I am suddenly reminded of a description of Zhang Wuji learning Taiji Sword in "The Heaven Sword and Dragon Saber":

You must know that what Zhang Sanfeng passed to him was the "sword intent" rather than "sword moves." He had to forget almost every move he had seen until nothing was left, only then could he grasp the essence. During battle, one controls the sword with intent—thousand variations, infinite possibilities. If even one or two sword moves are not forgotten cleanly, the heart will be constrained, and the swordsmanship cannot be pure.

It turns out that forgetting is the highest level of mastery! Therefore, although this article may seem off-topic, it is a truly solid essay— "On the Importance of Forgetting."

(Screenshot from "The Heaven Sword and Dragon Saber: Kung Fu Cult Master")