By 苏剑林 | May 10, 2019
Conditional generation results on ImageNet (128x128) using the model discussed in this article.
The results to be introduced today are still related to energy-based models, originating from the paper "Implicit Generation and Generalization in Energy-Based Models". Of course, it is no longer strictly related to GANs, but since it shares a strong connection with the energy models introduced in the second post of this series, I decided to include it in this series as well.
I originally noticed this paper due to a report from JiQiZhiXin titled "MIT Undergraduate God Reboots Energy-Based Generative Models; New Framework Rivals GANs". To be honest, the article itself might feel redundant to some—it’s essentially a "reinvention of the wheel." The media headline was somewhat accurate in using the word "reboot." The paper points out that an energy-based model is essentially the stationary solution of a specific Langevin equation, and then uses that Langevin equation to perform sampling. With a sampling process in place, the training of the energy model can be completed. These theories have existed for a long time; I had similar thoughts while studying stochastic differential equations, and I believe many others have as well. Therefore, in my view, the authors' contribution lies in implementing this straightforward idea through a series of "alchemy" (fine-tuning) techniques.
Nonetheless, being able to train such a model successfully is a significant achievement. Furthermore, for readers who haven't encountered this topic before, it serves as an excellent case study of energy-based models. Thus, I will summarize the overall logic of the paper to help readers gain a more comprehensive understanding of energy models.
Energy Distribution
Similar to "GAN Models from the Perspective of Energy (Part 2): GAN = 'Analysis' + 'Sampling'", suppose we have a set of data $x_1, x_2, \dots, x_n \sim p(x)$. We wish to fit it using a probabilistic model, which we define as:
\begin{equation}q_{\theta}(x) = \frac{e^{-U_{\theta}(x)}}{Z_{\theta}}\end{equation}
where $U_{\theta}$ is an undetermined function with parameters $\theta$, called the "energy function," and $Z_{\theta}$ is the normalization factor (partition function):
\begin{equation}Z_{\theta} = \int e^{-U_{\theta}(x)}dx\label{eq:z}\end{equation}
Such a distribution is called an "energy distribution," or a "Boltzmann distribution" in physics.
To find the parameters $\theta$, we first define the log-likelihood function:
\begin{equation}\mathbb{E}_{x\sim p(x)} \big[\log q_{\theta}(x)\big]\end{equation}
We want to maximize this, which is equivalent to minimizing:
\begin{equation}L_{\theta}=\mathbb{E}_{x\sim p(x)} \big[-\log q_{\theta}(x)\big]\end{equation}
To do this, we use gradient descent on $L_{\theta}$. We have (refer to the second post for the derivation):
\begin{equation}\nabla_{\theta}\log q_{\theta}(x)=-\nabla_{\theta} U_{\theta}(x)+\mathbb{E}_{x\sim q_{\theta}(x)}\big[\nabla_{\theta} U_{\theta}(x)\big]\end{equation}
So,
\begin{equation}\nabla_{\theta} L_{\theta} = \mathbb{E}_{x\sim p(x)}\big[\nabla_{\theta} U_{\theta}(x)\big] - \mathbb{E}_{x\sim q_{\theta}(x)}\big[\nabla_{\theta} U_{\theta}(x)\big]\label{eq:q-grad}\end{equation}
This means the gradient descent update formula is:
\begin{equation}\theta \leftarrow \theta - \varepsilon \Big(\mathbb{E}_{x\sim p(x)}\big[\nabla_{\theta} U_{\theta}(x)\big] - \mathbb{E}_{x\sim q_{\theta}(x)}\big[\nabla_{\theta} U_{\theta}(x)\big]\Big)\end{equation}
Langevin Equation
In Equation $\eqref{eq:q-grad}$, $\mathbb{E}_{x\sim p(x)}\big[\nabla_{\theta} U_{\theta}(x)\big]$ is easy to estimate by sampling a batch of real data. However, $\mathbb{E}_{x\sim q_{\theta}(x)}\big[\nabla_{\theta} U_{\theta}(x)\big]$ is difficult because we don't know how to sample from $q_{\theta}(x)$.
The strategy in the second post was to define another easily-sampled distribution $q_{\varphi}(x)$, sample from it, and minimize the difference between $q_{\varphi}(x)$ and $q_{\theta}(x)$. This paper, however, takes a different approach: it samples directly from the Langevin equation corresponding to the energy model.
The logic is simple and was mentioned in the previous article. For the Langevin equation:
\begin{equation}x_{t+1} = x_t - \frac{1}{2}\varepsilon \nabla_x U(x_t) + \sqrt{\varepsilon}\alpha,\quad \alpha \sim \mathcal{N}(\alpha;0,1)\label{eq:sde}\end{equation}
As $\varepsilon \to 0$ and $t \to \infty$, the distribution of the sequence $\{x_t\}$ becomes $q_{\theta}(x)$. In other words, $q_{\theta}(x)$ is the stationary distribution of this Langevin equation. Put simply, once $U_{\theta}(x)$ is given (and thus $q_{\theta}(x)$ is determined), the recursive process in Eq. $\eqref{eq:sde}$ allows us to obtain samples from $q_{\theta}(x)$.
With this sampling process, everything is set. Now $\mathbb{E}_{x\sim q_{\theta}(x)}\big[\nabla_{\theta} U_{\theta}(x)\big]$ can be estimated, allowing the energy model to be trained. Once training is complete, the same Eq. $\eqref{eq:sde}$ generates new samples, thus completing the generative process.
Model Details
While the theory is sound, the practical implementation involves many details and significant "alchemy." I had considered this path earlier but thought there were too many edge-case problems to solve. I admire the authors for persevering and making it work.
First, the authors added Spectral Normalization to the model $U_{\theta}(x)$. Since $U_{\theta}(x)$ essentially plays the role of the discriminator in a GAN, adding Spectral Normalization makes sense. Second, during training, the energy function used is not just $U_{\theta}(x)$, but $U_{\theta}(x) + \lambda U_{\theta}^2(x)$, where $\lambda$ is a small positive constant. The authors suggest this makes the loss smoother and the training more stable (at inference, only $U_{\theta}(x)$ is used).
Regarding sampling with Eq. $\eqref{eq:sde}$, it's an iterative process that requires an initial value. If one samples the initial vector from a random distribution (like a uniform distribution) every time, the authors noted it leads to "mode collapse," where generated images look very similar because the sampling isn't exploratory enough. To solve this, the authors maintain a Buffer that stores historical sampling results to serve as initial values for future sampling.
Broadly, the model update process is as follows:
Assume the data distribution is $p(x)$. Set the iteration step size $\varepsilon$ (ref value $1/200$), the number of steps $K$ (ref value $20\sim 50$), and batch size $N$. Let the Buffer be $\mathcal{B}$, initialized as an empty set.
Loop until convergence:
Loop to obtain a batch of real and fake samples:
- Sample a real data point $x_r$ from $p(x)$ and add it to the current batch.
- With a 95% probability, pick a sample from $\mathcal{B}$ as the initial value $x_{f,0}$; otherwise (5% probability), use a uniform distribution.
- Starting from $x_{f,0}$, iterate Eq. $\eqref{eq:sde}$ for $K$ steps to obtain $x_{f,K}$.
- Treat $x_{f,K}$ as the fake sample $x_f$, add it to the batch, and update $\mathcal{B}$ with it.
With the real and fake samples, perform one optimizer step with the objective:
$\frac{1}{N}\sum\limits_{x_r, x_f} \Big\{U_{\theta}(x_r) - U_{\theta}(x_f) + \lambda \big[U_{\theta}^2(x_r) - U_{\theta}^2(x_f)\big]\Big\}$
Post-training sampling also requires maintaining a Buffer. To ensure diversity, the authors trained the model several times with different weights and sampled from these models simultaneously, sharing and collectively maintaining a single Buffer. For other details, please refer to the original paper; since I don't intend to replicate it, I won't delve deeper here.
Author's Implementation: https://github.com/openai/ebm_code_release
Personal Summary
Overall, I believe this is a respectable and decent paper. The logic and theory are mature, as the relationship between energy models and Langevin equations has been established long ago. However, overcoming the minute technical hurdles and actually implementing this idea is no small feat, showcasing the authors' deep expertise (and "alchemy" skills) in the field of generative models. From the energy model perspective, it provides a viable scheme for training complex energy-based models.
As for results, one could say it rivals GANs, but one could also argue it falls short. The authors mostly experimented on CIFAR-10 and ImageNet—both notoriously difficult datasets. Looking at the results, the model can indeed go head-to-head with many GANs; it clearly outperforms Glow on CIFAR-10. On the flip side, it feels highly heuristic and not particularly elegant—for instance, the sampling via Langevin dynamics feels somewhat precarious, and the Buffer approach, while effective, feels very engineering-heavy.
Unconditional generation results on CIFAR-10.