By 苏剑林 | July 18, 2018
Preface: I have loved pure mathematics since elementary school, and later developed an interest in physics, even studying theoretical physics for a period of time until I gradually entered the field of machine learning after graduating from my undergraduate studies. Therefore, even in the field of machine learning, my research habits still retain the style of mathematics and physics: attempting to understand and derive as much as possible from the fewest principles. This article is one of the results of this philosophy, trying to use variational inference as a starting point to unifiedly understand various models in deep learning, especially the various dazzling GANs. This paper has already been posted to arXiv; those who need to read the original English manuscript can move to "Variational Inference: A Unified Framework of Generative Models and Some Revelations".
Below is the introduction to the article. Actually, the Chinese version of the information might be slightly richer than the English version, forgive my clumsy English...
Abstract: This article elucidates variational inference from a new perspective and proves that the EM algorithm, VAE, GAN, AAE, and ALI (BiGAN) can all be viewed as specific cases of variational inference. Among them, the paper also shows that the optimization goal of standard GAN is incomplete, which explains why GAN training requires careful selection of various hyperparameters. Finally, the article provides a regularization term that can improve this incompleteness, and experiments show that this regularization term can enhance the stability of GAN training.
In recent years, deep generative models, especially GANs, have achieved great success. We can now find dozens or even hundreds of GAN variants. However, most of them are improved based on experience, with very few being guided by a relatively complete theory.
The goal of this article is to establish a unified framework for these generative models through variational inference. First, this article introduces a new form of variational inference, which has actually been introduced in previous blog posts; it allows us to derive the Variational Autoencoder (VAE) and the EM algorithm within a few lines. Then, using this new form, we can directly derive GAN and discover that the standard GAN loss is actually incomplete, lacking a regularization term. Without this regularization term, we need to carefully adjust hyperparameters to make the model converge.
In fact, the original intention of this work was to incorporate GAN into the framework of variational inference. Currently, the initial goal has been achieved, and the results are gratifying. The newly derived regularization term is actually a byproduct, and fortunately, this byproduct worked in our experiments.
A New Interpretation of Variational Inference
Suppose $x$ is an observable variable, $z$ is a latent variable, $\tilde{p}(x)$ is the evidence distribution of $x$, and we have
\begin{equation}q(x)=q_{\theta}(x)=\int q_{\theta}(x,z)dz\end{equation}
We hope $q_{\theta}(x)$ can approximate $\tilde{p}(x)$, so in general, we would maximize the likelihood function
\begin{equation}\theta = \mathop{\text{argmax}}_{\theta}\, \int \tilde{p}(x)\log q(x) dx\end{equation}
This is also equivalent to minimizing the KL divergence $KL(\tilde{p}(x)\Vert q(x))$:
\begin{equation}KL(\tilde{p}(x)\Vert q(x)) = \int \tilde{p}(x) \log \frac{\tilde{p}(x)}{q(x)}dx\end{equation}
However, since the integral may be difficult to calculate, in most cases it is difficult to optimize directly.
In variational inference, a joint distribution $p(x,z)$ is first introduced such that $\tilde{p}(x)=\int p(x,z)dz$. The essence of variational inference is to change the KL divergence of the marginal distribution $KL(\tilde{p}(x)\Vert q(x))$ to the KL divergence of the joint distribution $KL(p(x,z)\Vert q(x,z))$ or $KL(q(x,z)\Vert p(x,z))$. Since
\begin{equation}
\begin{aligned}
KL(p(x,z)\Vert q(x,z)) &= KL(\tilde{p}(x)\Vert q(x)) + \int \tilde{p}(x) KL(p(z|x)\Vert q(z|x)) dx \\
&\geq KL(\tilde{p}(x)\Vert q(x))
\end{aligned}
\end{equation}
this means that the KL divergence of the joint distribution is a stronger condition (an upper bound). Therefore, once optimization is successful, we get $q(x,z)\to p(x,z)$, and thus $\int q(x,z)dz \to \int p(x,z)dz = \tilde{p}(x)$, meaning $\int q(x,z)dz$ becomes an approximation of the true distribution $\tilde{p}(x)$.
Of course, we are not strengthening the condition for the sake of strengthening it, but because in many cases, $KL(p(x,z)\Vert q(x,z))$ or $KL(q(x,z)\Vert p(x,z))$ is often easier to calculate than $KL(\tilde{p}(x)\Vert q(x))$. Thus, variational inference provides a computable alternative.
VAE and the EM Algorithm
From the aforementioned new understanding of variational inference, we can derive two basic results within a few sentences: Variational Autoencoders and the EM algorithm. This part of the content has actually been introduced in detail in "From Maximum Likelihood to EM Algorithm: A Consistent Way of Understanding" and "Variational Autoencoder (2): From a Bayesian Perspective". I will mention them briefly here.
VAE
In VAE, we set $q(x,z)=q(x|z)q(z)$ and $p(x,z)=\tilde{p}(x) p(z|x)$, where $q(x|z)$ and $p(z|x)$ are Gaussian distributions with unknown parameters, and $q(z)$ is a standard Gaussian distribution. The goal of minimization is
\begin{equation}\label{eq:kl-oo}KL\left(p(x,z)\Vert q(x,z) \right)=\iint \tilde{p}(x) p(z|x) \log \frac{\tilde{p}(x) p(z|x)}{q(x|z)q(z)}dxdz\end{equation}
where $\log \tilde{p}(x)$ does not contain optimization targets and can be regarded as a constant, and the integral over $\tilde{p}(x)$ is transformed into sampling over data points, thus obtaining:
\begin{equation}\mathbb{E}_{x\sim \tilde{p}(x)}\left[-\int p(z|x)\log q(x|z)dz + KL(p(z|x)\Vert q(z))\right]\end{equation}
Since $q(x|z)$ and $p(z|x)$ are Gaussian distributions implemented with neural networks, $KL\left(p(z|x)\Vert q(z)\right)$ can be calculated explicitly. By using the reparameterization trick to sample a point to complete the estimation of the integral $\int p(z|x) \log q(x|z)dz$, we get the final loss to be minimized for VAE:
\begin{equation}\mathbb{E}_{x\sim \tilde{p}(x)}\Big[-\log q(x|z) + KL(p(z|x)\Vert q(z))\Big]\end{equation}
EM Algorithm
In VAE, we constrained the posterior distribution, assuming it is only a Gaussian distribution, so we optimize the parameters of the Gaussian distribution. If we do not make this assumption, directly optimizing the original objective $\eqref{eq:kl-oo}$ is also feasible in some cases, but at this point, we can only adopt an alternating optimization method: first fix $p(z|x)$ and optimize $q(x|z)$, then we have
\begin{equation}\label{eq:em-1}q(x|z) = \mathop{\text{argmax}}_{q(x|z)} \,\mathbb{E}_{x\sim \tilde{p}(x)}\left[\int p(z|x) \log q(x,z) dz\right]\end{equation}
After completing this step, we fix $q(x,z)$ and optimize $p(z|x)$. First, write $q(x|z)q(z)$ in the form $q(z|x)q(x)$:
\begin{equation}q(x)=\int q(x|z)q(z)dz,\quad q(z|x)=\frac{q(x|z)q(z)}{q(x)}\end{equation}
Then we have
\begin{equation}
\begin{aligned}
p(z|x) =& \mathop{\text{argmin}}_{p(z|x)} \,\mathbb{E}_{x\sim \tilde{p}(x)}\left[\int p(z|x) \log \frac{p(z|x)}{q(z|x)q(x)} dz\right] \\
=& \mathop{\text{argmin}}_{p(z|x)} \,\mathbb{E}_{x\sim \tilde{p}(x)}\left[KL\left(p(z|x)\Vert q(z|x)\right)-\log q(x)\right] \\
=& \mathop{\text{argmin}}_{p(z|x)} \,\mathbb{E}_{x\sim \tilde{p}(x)} \left[KL\left(p(z|x)\Vert q(z|x)\right)\right]
\end{aligned}
\end{equation}
Since there are currently no constraints on $p(z|x)$, we can directly set $p(z|x)=q(z|x)$ to make the loss equal to 0. In other words, $p(z|x)$ has a theoretical optimal solution:
\begin{equation}\label{eq:em-2}p(z|x) = \frac{q(x|z)q(z)}{\int q(x|z)q(z)dz}\end{equation}
The alternating execution of $\eqref{eq:em-1}$ and $\eqref{eq:em-2}$ constitutes the steps of the EM algorithm. In this way, we quickly obtain the EM algorithm from the variational inference framework.
GAN under Variational Inference
In this part, we introduce a generalized method to incorporate GAN into variational inference, which will lead us to a new understanding of GAN and an effective regularization term.
General Framework
Like VAE, GAN also aims to train a generative model $q(x|z)$ map $q(z)=N(z;0,I)$ to the dataset distribution $\tilde{p}(x)$. Unlike choosing $q(x|z)$ as a Gaussian distribution in VAE, the choice for GAN is
\begin{equation}q(x|z)=\delta\left(x - G(z)\right),\quad q(x)=\int q(x|z)q(z)dz\end{equation}
where $\delta(x)$ is the Dirac delta function and $G(z)$ is the neural network of the generator.
Generally, we consider $z$ as a latent variable. However, because the delta function effectively implies a point distribution, it can be considered that the relationship between $x$ and $z$ is already one-to-one. Therefore, the relationship between $z$ and $x$ is already "not random enough," so in GAN, we do not consider it a latent variable (meaning we do not need to consider the posterior distribution $p(z|x)$).
In fact, in GAN, only a binary latent variable $y$ is introduced to construct the joint distribution
\begin{equation}q(x,y)=\left\{\begin{aligned}&\tilde{p}(x)p_1,\,y=1\\&q(x)p_0,\,y=0\end{aligned}\right.\end{equation}
Here $p_1 = 1-p_0$ describes a binary probability distribution; we directly take $p_1=p_0=1/2$. On the other hand, we set $p(x,y)=p(y|x) \tilde{p}(x)$, where $p(y|x)$ is a conditional Bernoulli distribution. The optimization goal is the KL divergence in the other direction $KL\left(q(x,y)\Vert p(x,y) \right)$:
\begin{equation}
\begin{aligned}
KL\left(q(x,y)\Vert p(x,y) \right)=&\int \tilde{p}(x)p_1\log \frac{\tilde{p}(x)p_1}{p(1|x)\tilde{p}(x)}dx+\int q(x)p_0\log \frac{q(x)p_0}{p(0|x)\tilde{p}(x)}dx \\
\sim&\int \tilde{p}(x)\log \frac{1}{p(1|x)}dx+\int q(x)\log \frac{q(x)}{p(0|x)\tilde{p}(x)}dx
\end{aligned}
\end{equation}
Once successfully optimized, we have $q(x,y)\to p(x,y)$, then
\begin{equation}p_1 \tilde{p}(x) + p_0 q(x) = \sum_y q(x,y) \to \sum_y p(x,y) = \tilde{p}(x)\end{equation}
Thus $q(x)\to\tilde{p}(x)$, completing the construction of the generative model.
Now our optimization objects are $p(y|x)$ and $G(z)$. Let $p(1|x)=D(x)$, which is the discriminator. Similar to the EM algorithm, we perform alternating optimization: first fix $G(z)$, which also means $q(x)$ is fixed, then optimize $p(y|x)$. At this time, omitting constants, the optimization goal is:
\begin{equation}D = \mathop{\text{argmin}}_{D} -\mathbb{E}_{x\sim\tilde{p}(x)}\left[\log D(x)\right]-\mathbb{E}_{x\sim q(x)}\left[\log (1-D(x))\right]\end{equation}
Then fix $D(x)$ to optimize $G(z)$. At this time, the related loss is:
\begin{equation}\label{eq:gan-g-loss}G = \mathop{\text{argmin}}_{G}\int q(x)\log \frac{q(x)}{(1-D(x)) \tilde{p}(x)}dx\end{equation}
This contains $\tilde{p}(x)$, which we do not know. However, if the $D(x)$ model has sufficient fitting capacity, then following the same logic as $\eqref{eq:em-2}$, the optimal solution for $D(x)$ should be
\begin{equation}D(x)=\frac{\tilde{p}(x)}{\tilde{p}(x)+q^{o}(x)}\end{equation}
where $q^{o}(x)$ is the $q(x)$ from the previous stage. Solving for $\tilde{p}(x)$ and substituting it into $\eqref{eq:gan-g-loss}$ gives
\begin{equation}
\begin{aligned}
\int q(x)\log \frac{q(x)}{D(x) q^{o}(x)}dx=&-\mathbb{E}_{x\sim q(x)}\log D(x) + KL\left(q(x)\Vert q^{o}(x)\right) \\
=&-\mathbb{E}_{z\sim q(z)}\log D(G(z)) + KL\left(q(x)\Vert q^{o}(x)\right)
\end{aligned}
\end{equation}
Basic Analysis
As can be seen, the first term is one of the standard losses used by standard GAN generators:
\begin{equation}-\mathbb{E}_{z\sim q(z)}\log D(G(z))\end{equation}
The extra second term describes the distance between the new distribution and the old distribution. These two loss terms are adversarial, because $KL\left(q(x)\Vert q^{o}(x)\right)$ hopes that the new and old distributions are as consistent as possible, but if the discriminator is fully optimized, $D(x)$ is very small for samples in the old distribution $q^{o}(x)$ (almost all are identified as negative samples), so $-\log D(x)$ will be quite large, and vice versa. In this way, if the entire loss is optimized together, the model must both "inherit" the old distribution $q^{o}(x)$ and explore in the new direction $p(1|y)$, interpolating between the old and the new.
We know that currently, standard GAN generator losses do not include $KL\left(q(x)\Vert q^{o}(x)\right)$. This actually causes the loss to be incomplete. Assuming there is an optimization algorithm that can always find the theoretical optimal solution for $G(z)$, and $G(z)$ has infinite fitting capacity, then $G(z)$ only needs to generate a single sample that maximizes $D(x)$ (regardless of the input $z$), which is model collapse. In this sense, theoretically, it must happen.
So, what does $KL\left(q(x)\Vert q^{o}(x)\right)$ inspire us to do? We set
\begin{equation}q^{o}(x)=q_{\theta-\Delta \theta}(x),\quad q(x)=q_{\theta}(x)\end{equation}
That is to say, assuming the current model's parameter change is $\Delta\theta$, expanding to the second order gives
\begin{equation}KL\left(q(x)\Vert q^{o}(x)\right)\approx \int\frac{\left(\Delta\theta\cdot \nabla_{\theta}q_{\theta}(x)\right)^2}{2q_{\theta}(x)} dx \approx \left(\Delta\theta\cdot c\right)^2\end{equation}
We have pointed out that a complete GAN generator loss function should include $KL\left(q(x)\Vert q^{o}(x)\right)$. If it is not included, it must be achieved through various indirect means. The above approximation shows that the additional loss is about $\left(\Delta\theta\cdot c\right)^2$, which requires us not to let it be too large, that is, not to let $\Delta\theta$ be too large (at each stage $c$ can be approximately considered a constant). Since we use gradient-based optimization algorithms, $\Delta\theta$ is proportional to the gradient. Therefore, many tricks in standard GAN training, such as gradient clipping, using the Adam optimizer, and using Batch Normalization (BN), can be explained. They are all for the purpose of stabilizing the gradient so that $\Delta\theta$ is not too large. At the same time, the number of iterations for $G(z)$ should not be too many, because too many iterations will also lead to excessive $\Delta\theta$.
Moreover, this part of the analysis only applies to the generator, while the discriminator itself is not constrained, so the discriminator can be trained to optimality.
Regularization Term
Now we calculate something truly useful from this: a direct estimation of $KL\left(q(x)\Vert q^{o}(x)\right)$ to obtain a regularization term that can be used in actual training. Direct calculation is difficult, but we can estimate it using $KL\left(q(x,z)\Vert \tilde{q}(x,z)\right)$:
\begin{equation}
\begin{aligned}
KL\left(q(x,z)\Vert \tilde{q}(x,z)\right)=&\iint q(x|z)q(z)\log \frac{q(x|z)q(z)}{\tilde{q}(x|z)q(z)}dxdz \\
=&\iint \delta\left(x-G(z)\right)q(z)\log \frac{\delta\left(x-G(z)\right)}{\delta\left(x-G^{o}(z)\right)}dxdz \\
=&\int q(z)\log \frac{\delta(0)}{\delta\left(G(z)-G^{o}(z)\right)}dz
\end{aligned}
\end{equation}
Because there is a limit
\begin{equation}\delta(x)=\lim_{\sigma\to 0}\frac{1}{(2\pi\sigma^2)^{d/2}}\exp\left(-\frac{x^2}{2\sigma^2}\right)\end{equation}
we can consider $\delta(x)$ as a Gaussian distribution with small variance. Substituting it into the calculation, we have
\begin{equation}KL\left(q(x)\Vert q^{o}(x)\right)\sim \lambda \int q(z)\Vert G(z) - G^{o}(z)\Vert^2 dz\end{equation}
Therefore, the complete generator loss can be chosen as
\begin{equation}\label{eq:reg-loss}\mathbb{E}_{z\sim q(z)}\left[-\log D(G(z))+\lambda \Vert G(z) - G^{o}(z)\Vert^2\right] \end{equation}
In other words, the distance between the new and old generated samples can be used as a regularization term. The regularization term ensures that the model does not deviate too far from the old distribution.
The following two experiments on the face data CelebA show that this regularization term is effective. The experimental code is modified from here and is currently available on my GitHub.
Experiment 1: Ordinary DCGAN network, each iteration trains the generator and discriminator for one batch.
Without the regularization term, the model starts to collapse after 25 epochs.
With the regularization term, the model can be trained stably all the time.
Experiment 2: Ordinary DCGAN network but with BN removed, each iteration trains the generator and discriminator for five batches.
Without the regularization term, the model's convergence speed is relatively slow.
With the regularization term, the model "gets on track" faster.
GAN Related Models
Adversarial Autoencoders (AAE) and Adversarially Learned Inference (ALI) are variants of GAN. They can also be incorporated into variational inference. Of course, after the previous preparations, this is just like a few exercise problems.
Interestingly, in ALI, we have some counter-intuitive results.
AAE from a GAN Perspective
In fact, just swap the positions of $x$ and $z$ in the GAN discussion to get the AAE framework.
Specifically, AAE aims to train an encoding model $p(z|x)$ to map the true distribution $\tilde{q}(x)$ to the standard Gaussian distribution $q(z)=N(z;0,I)$, where
\begin{equation}p(z|x)=\delta\left(z - E(x)\right),\quad p(z)=\int p(z|x)\tilde{q}(x)dx\end{equation}
where $E(x)$ is the neural network of the encoder.
Like GAN, AAE introduces a binary latent variable $y$, and has
\begin{equation}p(z,y)=\left\{\begin{aligned}&p(z)p_1,\,y=1\\&q(z)p_0,\,y=0\end{aligned}\right.\end{equation}
Similarly, we directly take $p_1=p_0=1/2$. On the other hand, we set $q(z,y)=q(y|z) q(z)$, where the posterior distribution $q(y|z)$ is a binary distribution with input $z$, and then optimize $KL\left(p(z,y)\Vert q(z,y) \right)$:
\begin{equation}
\begin{aligned}
KL\left(p(z,y)\Vert q(z,y) \right)=&\int p(z)p_1\log \frac{p(z)p_1}{q(1|z)q(z)}dz+\int q(z)p_0\log \frac{q(z)p_0}{q(0|z)q(z)}dz \\
\sim&\int p(z)\log \frac{p(z)}{q(1|z)q(z)}dz+\int q(z)\log \frac{1}{q(0|z)}dz
\end{aligned}
\end{equation}
Now our optimization objects are $q(y|z)$ and $E(x)$. Let $q(0|z)=D(z)$, and perform alternating optimization: first fix $E(x)$, which means $p(z)$ is fixed, then optimize $q(y|z)$. Omitting constants, we get the optimization goal as:
\begin{equation}
\begin{aligned}
D=\mathop{\text{argmin}}_D &-\mathbb{E}_{z\sim p(z)}\left[\log (1-D(z))\right]-\mathbb{E}_{z\sim q(z)}\left[\log D(z)\right] \\
=\mathop{\text{argmin}}_D &-\mathbb{E}_{z\sim \tilde{p}(x)}\left[\log (1-D(E(x)))\right]-\mathbb{E}_{z\sim q(z)}\left[\log D(z)\right]
\end{aligned}
\end{equation}
Then fix $D(z)$ to optimize $E(x)$. At this time, the related loss is:
\begin{equation}E = \mathop{\text{argmin}}_E \int p(z)\log \frac{p(z) }{(1-D(z)) q(z)}dz\end{equation}
Using the theoretical optimal solution for $D(z)$, $D(z)=q(z)/[p^{o}(z)+q(z)]$, and substituting it into the loss, we get
\begin{equation}\mathbb{E}_{x\sim \tilde{p}(x)}[-\log D(E(x))] + KL\left(p(z)\Vert p^{o}(z)\right)\end{equation}
On one hand, like standard GAN, by training carefully, we can remove the second term and get
\begin{equation}\mathbb{E}_{x\sim \tilde{p}(x)}[-\log D(E(x))]\end{equation}
On the other hand, we can train a decoder $G(z)$ after obtaining the encoder. But if the assumed fitting capacity of $E(x)$ and $G(z)$ is sufficient, the reconstruction error can be small enough, and adding $G(z)$ to the above loss will not interfere with the training of GAN, so they can be trained jointly:
\begin{equation}G,E = \mathop{\text{argmin}}_{G,E}\mathbb{E}_{x\sim \tilde{p}(x)}\left[-\log D(E(x))+\lambda\Vert x - G(E(x))\Vert^2\right]\end{equation}
Counter-intuitive ALI Version
ALI is like a fusion of GAN and AAE; another almost identical work is Bidirectional GAN (BiGAN). Compared to GAN, it also incorporates $z$ as a latent variable into variational inference. Specifically, in ALI we have
\begin{equation}q(x,z,y)=\left\{\begin{aligned}&p(z|x)\tilde{p}(x) p_1,\,y=1\\&q(x|z)q(z)p_0,\,y=0\end{aligned}\right.\end{equation}
and $p(x,z,y)=p(y|x,z) p(z|x) \tilde{p}(x)$, and then optimize $KL\left(q(x,z,y)\Vert p(x,z,y) \right)$:
\begin{equation}
\begin{aligned}
&\iint p(z|x)\tilde{p}(x) p_1\log \frac{p(z|x)\tilde{p}(x) p_1}{p(1|x,z) p(z|x) \tilde{p}(x)}dxdz \\
+&\iint q(x|z)q(z)p_0\log \frac{q(x|z)q(z)p_0}{p(0|x,z) p(z|x) \tilde{p}(x)}dxdz
\end{aligned}
\end{equation}
which is equivalent to minimizing
\begin{equation}\label{eq: ori-loss-ali}\iint p(z|x)\tilde{p}(x)\log \frac{1}{p(1|x,z)}dxdz+\iint q(x|z)q(z)\log \frac{q(x|z)q(z)}{p(0|x,z) p(z|x) \tilde{p}(x)}dxdz\end{equation}
Now the optimization targets are $p(y|x,z), p(z|x), q(x|z)$. Let $p(1|x,z)=D(x,z)$. $p(z|x)$ is a Gaussian distribution or Dirac distribution with encoder $E$, and $q(x|z)$ is a Gaussian distribution or Dirac distribution with generator $G$. Still alternating optimization: first fix $E,G$, then the loss related to $D$ is
\begin{equation}D=\mathop{\text{argmin}}_D -\mathbb{E}_{x\sim\tilde{p}(x),z\sim p(z|x)} \log D(x,z) - \mathbb{E}_{z\sim q(z),x\sim q(x|z)} \log (1-D(x,z))\end{equation}
Like VAE, the expectations for $p(z|x)$ and $q(x|z)$ can be completed through the "reparameterization" trick. Next, fix $D$ to optimize $G,E$. Since there are both $E$ and $G$ at this time, the entire loss cannot be simplified and remains as in $\eqref{eq: ori-loss-ali}$. But using the optimal solution for $D$
\begin{equation}D(x,z)=\frac{p^{o}(z|x)\tilde{p}(x)}{p^{o}(z|x)\tilde{p}(x)+q^{o}(x|z)q(z)}\end{equation}
it can be transformed into
\begin{equation}
\begin{aligned}
-\iint p(z|x)\tilde{p}(x)\log D(x, z) dxdz &-\iint q(x|z) q(z)\log D(x, z) dxdz \\
+\int q(z) KL(q(x|z)\Vert q^o(x|z)) dz &+ \iint q(x|z) q(z)\log \frac{p^o(z|x)}{p(z|x)}dxdz
\end{aligned}
\end{equation}
Since $q(x|z)$ and $p(x|z)$ are all Gaussian distributions, we can actually calculate the last two terms concretely (coordinates with the reparameterization trick). But just like standard GAN, with careful training, we can simply remove the last two terms, obtaining
\begin{equation}\label{eq:our-ali-g}-\iint p(z|x)\tilde{p}(x)\log D(x, z) dxdz -\iint q(x|z) q(z)\log D(x, z) dxdz\end{equation}
This is the derived ALI generator and encoder loss, which is different from the standard ALI result. Standard ALI (including ordinary GAN) views it as a minimax problem, so the generator and encoder losses are
\begin{equation}\label{eq:our-ali-g-o1}\iint p(z|x)\tilde{p}(x)\log D(x, z) dxdz + \iint q(x|z) q(z)\log (1-D(x, z)) dxdz\end{equation}
or
\begin{equation}\label{eq:our-ali-g-o2}-\iint p(z|x)\tilde{p}(x)\log (1-D(x, z)) dxdz - \iint q(x|z) q(z)\log D(x, z) dxdz\end{equation}
Neither is equivalent to $\eqref{eq:our-ali-g}$. Regarding this difference, the author also conducted experiments, and the results showed that ALI here has the same performance as standard ALI, or perhaps even slightly better (perhaps it's my own favorable illusion, so no figures were included). This explains that viewing adversarial networks as a minimax problem is only an intuitive behavior and should not always be so.
Conclusion Summary
The results of this article show that variational inference is indeed a unified framework for deriving and interpreting generative models, including VAE and GAN. Through a new interpretation of variational inference, we introduced how variational inference achieves this goal.
Of course, this is not the first article to propose using variational inference to study GAN. In the article "On Unifying Deep Generative Models", the authors also attempted to unify VAE and GAN using variational inference and obtained some enlightening results. But I feel that it was not clear enough. In fact, I did not fully understand that article; I am not sure whether it incorporated GAN into variational inference or incorporated VAE into GAN. Comparatively, I feel that the arguments in this article are clearer and more explicit.
It seems that there is still a lot of space for exploration in variational inference, waiting for us to explore.