By 苏剑林 | June 28, 2023
In the previous article "Generative Diffusion Model Ramblings (19): GAN as Diffusion ODE", we introduced how to understand GAN as a diffusion ODE in another time dimension. In short, GAN essentially transforms the movement of samples in a diffusion model into the movement of generator parameters! However, the derivation process in that article relied on relatively complex and independent content like Wasserstein gradient flow, making it difficult to connect well with the previous articles in the diffusion series, resulting in a technical "gap." In my view, the ReFlow introduced in "Generative Diffusion Model Ramblings (17): General Steps for Constructing ODE (Part 2)" is the most intuitive approach for understanding Diffusion ODEs. Since GAN can be understood from a Diffusion ODE perspective, there must exist a perspective to understand GAN via ReFlow. After some experimentation, I have successfully derived results similar to WGAN-GP from ReFlow.
The reason I say "ReFlow is the most intuitive approach for understanding Diffusion ODEs" is that it is highly flexible and aligns closely with experimental code—it establishes a mapping from any noise distribution to a target data distribution through an ODE, and the training objective is very straightforward, corresponding directly to experimental code without "convolutions." Specifically, assume $\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)$ is random noise sampled from a prior distribution, and $\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)$ is a real sample sampled from the target distribution (Note: In previous articles, $\boldsymbol{x}_T$ was generally noise and $\boldsymbol{x}_0$ was the target sample; here they are reversed for convenience). ReFlow allows us to specify any trajectory from $\boldsymbol{x}_0$ to $\boldsymbol{x}_1$. For simplicity, ReFlow chooses a straight line:
\begin{equation}\boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t \boldsymbol{x}_1\label{eq:line}\end{equation}Now we find the ODE it satisfies:
\begin{equation}\frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{x}_1 - \boldsymbol{x}_0\end{equation}This ODE is very simple, but it is not practical because we want to generate $\boldsymbol{x}_1$ from $\boldsymbol{x}_0$ via the ODE, yet the target we want to generate is placed inside the equation itself—a "reversal of cause and effect." To remedy this defect, ReFlow's idea is simple: learn a function of $\boldsymbol{x}_t$ to approximate $\boldsymbol{x}_1 - \boldsymbol{x}_0$, and then use it to replace $\boldsymbol{x}_1 - \boldsymbol{x}_0$:
\begin{equation}\boldsymbol{\varphi}^* = \mathop{\text{argmin}}_{\boldsymbol{\varphi}} \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\left[\frac{1}{2}\Vert\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t, t) - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\Vert^2\right]\label{eq:s-loss}\end{equation}and
\begin{equation}\frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{x}_1 - \boldsymbol{x}_0\quad\Rightarrow\quad\frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{v}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t, t)\label{eq:ode-core}\end{equation}We have previously proved that under the assumption that $\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t, t)$ has infinite fitting capability, the new ODE can indeed achieve sample transformation from the distribution $p_0(\boldsymbol{x}_0)$ to the distribution $p_1(\boldsymbol{x}_1)$.
One of the important characteristics of ReFlow is that it does not restrict the form of the prior distribution $p_0(\boldsymbol{x}_0)$. This means we can replace the prior distribution with any distribution we want, such as a distribution transformed by a generator $\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})$:
\begin{equation}\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)\quad\Leftrightarrow\quad \boldsymbol{x}_0 = \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}),\,\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\end{equation}Substituting this into equation $\eqref{eq:s-loss}$ after training is complete, we can use equation $\eqref{eq:ode-core}$ to transform any $\boldsymbol{x}_0 = \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})$ into a real sample $\boldsymbol{x}_1$.
However, we are not satisfied with just this. As mentioned earlier, GAN transforms the movement of samples in a diffusion model into the movement of generator parameters. This can also be done within the ReFlow framework: assuming the generator's current parameters are $\boldsymbol{\theta}_{\tau}$, we expect the change $\boldsymbol{\theta}_{\tau}\to \boldsymbol{\theta}_{\tau+1}$ to simulate taking a small step forward according to equation $\eqref{eq:ode-core}$:
\begin{equation}\boldsymbol{\theta}_{\tau+1} = \mathop{\text{argmin}}_{\boldsymbol{\theta}}\mathbb{E}_{\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})}\Big[\big\Vert \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) - \boldsymbol{g}_{\boldsymbol{\theta}_{\tau}}(\boldsymbol{z}) - \epsilon\,\boldsymbol{v}_{\boldsymbol{\varphi}^*}(\boldsymbol{g}_{\boldsymbol{\theta}_{\tau}}(\boldsymbol{z}), 0)\big\Vert^2\Big]\label{eq:g-loss}\end{equation}Note that $t$ in equations $\eqref{eq:s-loss}$ and $\eqref{eq:ode-core}$ does not have the same meaning as $\tau$ in parameters $\boldsymbol{\theta}_{\tau}$; the former is the time parameter of the ODE, and the latter is the training progress, so different notation is used. Furthermore, $\boldsymbol{g}_{\boldsymbol{\theta}_{\tau}}(\boldsymbol{z})$ appears as $\boldsymbol{x}_0$ for the ODE, so when pushing forward by a small step, we get $\boldsymbol{x}_{\epsilon}$, and the time $t$ substituted into $\boldsymbol{v}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t, t)$ should be $0$.
Now we have the new $\boldsymbol{g}_{\boldsymbol{\theta}_{\tau+1}}(\boldsymbol{z})$. Theoretically, the distribution it produces is closer to the true distribution (because it has moved one small step forward). We then treat it as the new $\boldsymbol{x}_0$, substitute it back into equation $\eqref{eq:s-loss}$ for training, and after training, substitute it into equation $\eqref{eq:g-loss}$ to optimize the generator, and so on. This is an alternating training process similar to GAN.
Can we quantitatively link this process to existing GANs? Yes! Specifically, to WGAN-GP with gradient penalty.
First, let's look at the loss function $\eqref{eq:s-loss}$. Expanding the expectation part results in:
\begin{equation}\frac{1}{2}\Vert\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t, t)\Vert^2 - \langle\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t, t),\boldsymbol{x}_1 - \boldsymbol{x}_0\rangle + \frac{1}{2}\Vert\boldsymbol{x}_1 - \boldsymbol{x}_0\Vert^2\end{equation}The third term is independent of the parameters $\boldsymbol{\varphi}$, so removing it doesn't affect the result. Now we assume that $\boldsymbol{v}_{\boldsymbol{\varphi}}$ has strong enough fitting capability such that we do not need to explicitly input $t$. Then the above as a loss function is equivalent to:
\begin{equation}\frac{1}{2}\Vert\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - \langle\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t),\boldsymbol{x}_1 - \boldsymbol{x}_0\rangle = \frac{1}{2}\Vert\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - \left\langle\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t),\frac{d\boldsymbol{x}_t}{dt}\right\rangle\end{equation}$\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)$ is a vector function with the same input and output dimensions. We further assume it is the gradient of some scalar function $D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)$, i.e., $\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)=\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)$. Then the above is:
\begin{equation}\frac{1}{2}\Vert\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - \left\langle\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t),\frac{d\boldsymbol{x}_t}{dt}\right\rangle = \frac{1}{2}\Vert\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - \frac{d D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)}{dt}\end{equation}Assuming that the change in $D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)$ is relatively smooth, then $\frac{d D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)}{dt}$ should be close to the finite difference at the points $t=0, t=1$, which is $D_{\boldsymbol{\varphi}}(\boldsymbol{x}_1)-D_{\boldsymbol{\varphi}}(\boldsymbol{x}_0)$. Thus, the loss function above is approximately:
\begin{equation}\frac{1}{2}\Vert\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - D_{\boldsymbol{\varphi}}(\boldsymbol{x}_1) + D_{\boldsymbol{\varphi}}(\boldsymbol{x}_0)\end{equation}Readers familiar with GANs should find this very recognizable—it is exactly the discriminator loss function for WGAN with gradient penalty! Even the construction of $\boldsymbol{x}_t$ for the gradient penalty term via $\eqref{eq:line}$ is identical (linear interpolation between real and fake samples)! The only difference is that the gradient penalty in the original WGAN-GP is centered at 1, whereas here it is centered at zero. In fact, articles such as "WGAN-div: An Obscure Filler of the WGAN Pit" and "Optimization Algorithms from a Dynamical Perspective (4): The Third Stage of GAN" have shown that zero-centered gradient penalties often perform better.
Therefore, under specific parameterization and assumptions, the loss function $\eqref{eq:s-loss}$ is equivalent to the WGAN-GP discriminator loss. As for the generator loss, in the previous article "Generative Diffusion Model Ramblings (19): GAN as Diffusion ODE", we already proved that when $\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)=\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)$, the gradient of the single-step optimization of equation $\eqref{eq:g-loss}$ is equivalent to the gradient of:
\begin{equation}\boldsymbol{\theta}_{\tau+1} = \mathop{\text{argmin}}_{\boldsymbol{\theta}}\mathbb{E}_{\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})}[-D(\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}))]\end{equation}which is precisely the WGAN-GP generator loss.
In this article, I have attempted to derive the connection between WGAN-GP and Diffusion ODEs starting from ReFlow. This perspective is relatively simpler and more intuitive, and it avoids relatively complex concepts such as Wasserstein gradient flow.