By 苏剑林 | August 8, 2022
In the previous article "Conversations on Generative Diffusion Models (5): SDE Perspective of the General Framework", we provided a basic introduction and derivation of Yang Song's paper "Score-Based Generative Modeling through Stochastic Differential Equations". However, as the name suggests, the previous article primarily covered the SDE-related parts of the original paper, leaving out the section referred to as "Probability flow ODE." This article serves as a supplementary share on that topic.
In fact, while this legacy content only occupies a small section in the main text of the original paper, it warrants a dedicated article for introduction. After much consideration, I found that the derivation of this result cannot bypass the Fokker-Planck equation. Therefore, we need to spend some space introducing the Fokker-Planck equation before the main protagonist, the ODE, can take the stage.
Let's roughly summarize the content of the previous article: First, we defined a forward process ("tearing down the building") through an SDE:
\begin{equation}d\boldsymbol{x} = \boldsymbol{f}_t(\boldsymbol{x}) dt + g_t d\boldsymbol{w}\label{eq:sde-forward}\end{equation}Then, we derived the corresponding reverse process SDE ("constructing the building"):
\begin{equation}d\boldsymbol{x} = \left[\boldsymbol{f}_t(\boldsymbol{x}) - g_t^2\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) \right] dt + g_t d\boldsymbol{w}\label{eq:sde-reverse}\end{equation}Finally, we derived the loss function (score matching) for estimating $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$ using a neural network $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t)$:
\begin{equation}\mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t \sim p(\boldsymbol{x}_t|\boldsymbol{x}_0)\tilde{p}(\boldsymbol{x}_0)}\left[\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right] \end{equation}At this point, we have completed a general framework for the training and prediction of diffusion models. It can be said to be a very generalized extension of DDPM. However, just as DDIM was a result of a "high-level rethink" of DDPM introduced in "Conversations on Generative Diffusion Models (4): DDIM = DDPM from a High-Level Perspective", does SDE, as an extension of DDPM, have a corresponding "high-level rethink"? Yes, the result is the subject of this article: the "Probability Flow ODE."
What kind of rethinking did DDIM perform? Put simply, DDIM discovered that the training objective of DDPM is primarily related to $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ and unrelated to $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$. Therefore, it took $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ as a starting point to derive more general forms of $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t,\boldsymbol{x}_0)$ and $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1},\boldsymbol{x}_0)$. The rethinking done by the Probability Flow ODE is similar; it wants to know, within the SDE framework, for a fixed $p(\boldsymbol{x}_t)$, what different $p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t)$ (or different forward SDEs) can be found.
We first write the discrete form of the forward process $\eqref{eq:sde-forward}$:
\begin{equation}\boldsymbol{x}_{t+\Delta t} = \boldsymbol{x}_t + \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t + g_t \sqrt{\Delta t}\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\label{eq:sde-discrete}\end{equation}This equation describes the relationship between the random variables $\boldsymbol{x}_{t+\Delta t}, \boldsymbol{x}_t, \boldsymbol{\varepsilon}$. We can easily take the expectation of both sides. However, we do not want to find the expectation, but rather the relationship satisfied by the distribution $p_t(\boldsymbol{y})$. How do we convert a distribution into an expectation form? The answer is the Dirac function:
\begin{equation}p(\boldsymbol{x}) = \int \delta(\boldsymbol{x} - \boldsymbol{y}) p(\boldsymbol{y}) d\boldsymbol{y} = \mathbb{E}_{\boldsymbol{y}}[\delta(\boldsymbol{x} - \boldsymbol{y})]\end{equation}The strictly rigorous definition of the Dirac function belongs to functional analysis, but we usually treat it as an ordinary function, which generally leads to correct results. From the above equation, we can also know that for any $f(\boldsymbol{x})$, the following holds:
\begin{equation}p(\boldsymbol{x})f(\boldsymbol{x}) = \int \delta(\boldsymbol{x} - \boldsymbol{y}) p(\boldsymbol{y})f(\boldsymbol{y}) d\boldsymbol{y} = \mathbb{E}_{\boldsymbol{y}}[\delta(\boldsymbol{x} - \boldsymbol{y}) f(\boldsymbol{y})]\end{equation}Taking the partial derivative of both sides with respect to $\boldsymbol{x}$, we get:
\begin{equation}\nabla_{\boldsymbol{x}}[p(\boldsymbol{x}) f(\boldsymbol{x})] = \mathbb{E}_{\boldsymbol{y}}\left[\nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{y}) f(\boldsymbol{y})\right] = \mathbb{E}_{\boldsymbol{y}}\left[f(\boldsymbol{y})\nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{y})\right]\end{equation}This is one of the properties we will use later. It can be seen that essentially the derivative of the Dirac function can be transferred to the multiplied function through integration.
With the padding above, we now use Equation $\eqref{eq:sde-discrete}$ to write:
\begin{equation}\begin{aligned} &\,\delta(\boldsymbol{x} - \boldsymbol{x}_{t+\Delta t}) \\[5pt] =&\, \delta(\boldsymbol{x} - \boldsymbol{x}_t - \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t - g_t \sqrt{\Delta t}\boldsymbol{\varepsilon}) \\ \approx&\, \delta(\boldsymbol{x} - \boldsymbol{x}_t) - \left(\boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t + g_t \sqrt{\Delta t}\boldsymbol{\varepsilon}\right)\cdot \nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{x}_t) + \frac{1}{2} \left(g_t\sqrt{\Delta t}\boldsymbol{\varepsilon}\cdot \nabla_{\boldsymbol{x}}\right)^2\delta(\boldsymbol{x} - \boldsymbol{x}_t) \end{aligned}\end{equation}Here we treat $\delta(\cdot)$ as an ordinary function and perform a Taylor expansion, keeping only terms up to $\mathcal{O}(\Delta t)$. Now we take the expectation of both sides:
\begin{equation}\begin{aligned} &\,p_{t+\Delta t}(\boldsymbol{x}) \\[6pt] =&\, \mathbb{E}_{\boldsymbol{x}_{t+\Delta t}}\left[\delta(\boldsymbol{x} - \boldsymbol{x}_{t+\Delta t})\right] \\ \approx&\, \mathbb{E}_{\boldsymbol{x}_t, \boldsymbol{\varepsilon}}\left[\delta(\boldsymbol{x} - \boldsymbol{x}_t) - \left(\boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t + g_t \sqrt{\Delta t}\boldsymbol{\varepsilon}\right)\cdot \nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{x}_t) + \frac{1}{2} \left(g_t\sqrt{\Delta t}\boldsymbol{\varepsilon}\cdot \nabla_{\boldsymbol{x}}\right)^2\delta(\boldsymbol{x} - \boldsymbol{x}_t)\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_t}\left[\delta(\boldsymbol{x} - \boldsymbol{x}_t) - \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t\cdot \nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{x}_t) + \frac{1}{2} g_t^2\Delta t \nabla_{\boldsymbol{x}}\cdot \nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{x}_t)\right] \\ =&\,p_t(\boldsymbol{x}) - \nabla_{\boldsymbol{x}}\cdot\left[\boldsymbol{f}_t(\boldsymbol{x})\Delta t\, p_t(\boldsymbol{x})\right] + \frac{1}{2}g_t^2\Delta t \nabla_{\boldsymbol{x}}\cdot\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x}) \end{aligned}\end{equation}Dividing both sides by $\Delta t$ and taking the limit $\Delta t\to 0$, we obtain:
\begin{equation}\frac{\partial}{\partial t} p_t(\boldsymbol{x}) = - \nabla_{\boldsymbol{x}}\cdot\left[\boldsymbol{f}_t(\boldsymbol{x}) p_t(\boldsymbol{x})\right] + \frac{1}{2}g_t^2 \nabla_{\boldsymbol{x}}\cdot\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x})\label{eq:fp} \end{equation}This is the "F-P Equation" (Fokker-Planck equation) corresponding to Equation $\eqref{eq:sde-forward}$. It is a partial differential equation describing the marginal distribution.
Don't be alarmed by the partial differential equation, as we don't intend to study how to solve it here; we are just using it to guide an equivalent transformation. For any function $\sigma_t$ satisfying $\sigma_t^2\leq g_t^2$, the F-P equation $\eqref{eq:fp}$ is exactly equivalent to:
\begin{equation}\begin{aligned} \frac{\partial}{\partial t} p_t(\boldsymbol{x}) =&\, - \nabla_{\boldsymbol{x}}\cdot\left[\boldsymbol{f}_t(\boldsymbol{x})p_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 - \sigma_t^2)\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x})\right] + \frac{1}{2}\sigma_t^2 \nabla_{\boldsymbol{x}}\cdot\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x}) \\ =&\, - \nabla_{\boldsymbol{x}}\cdot\left[\left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 - \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right)p_t(\boldsymbol{x})\right] + \frac{1}{2}\sigma_t^2 \nabla_{\boldsymbol{x}}\cdot\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x}) \end{aligned}\label{eq:fp-2}\end{equation}Formally, this F-P equation is equivalent to the original F-P equation with $\boldsymbol{f}_t(\boldsymbol{x})$ replaced by $\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 - \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$ and $g_t$ replaced by $\sigma_t$. Since Equation $\eqref{eq:fp}$ corresponds to Equation $\eqref{eq:sde-forward}$, the above equation corresponds to:
\begin{equation}d\boldsymbol{x} = \left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 - \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right) dt + \sigma_t d\boldsymbol{w}\label{eq:sde-forward-2}\end{equation}But do not forget that Equation $\eqref{eq:fp}$ and Equation $\eqref{eq:fp-2}$ are completely equivalent. This means that the forward stochastic differential equations in $\eqref{eq:sde-forward}$ and $\eqref{eq:sde-forward-2}$ correspond to exactly the same marginal distributions $p_t(\boldsymbol{x})$! This result tells us that forward processes with different variances exist that generate the same marginal distributions. This result is equivalent to an upgraded version of DDIM; we will also prove later that when $\boldsymbol{f}_t(\boldsymbol{x})$ is a linear function of $\boldsymbol{x}$, it is completely equivalent to DDIM.
In particular, according to the SDE results from the previous article, we can write the reverse SDE corresponding to Equation $\eqref{eq:sde-forward-2}$:
\begin{equation}d\boldsymbol{x} = \left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 + \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right) dt + \sigma_t d\boldsymbol{w}\label{eq:sde-reverse-2}\end{equation}Equation $\eqref{eq:sde-forward-2}$ allows us to change the variance of the sampling process. Here, we specifically consider the extreme case of $\sigma_t = 0$, where the SDE reduces to an ODE (Ordinary Differential Equation):
\begin{equation}d\boldsymbol{x} = \left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}g_t^2\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right) dt\label{eq:flow-ode}\end{equation}This ODE is called the "Probability flow ODE." Since in practice $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$ needs to be approximated by a neural network $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t)$, the above equation also corresponds to a "Neural ODE."
Why specifically study the case where the variance is 0? Because at this time the propagation process carries no noise, the mapping from $\boldsymbol{x}_0$ to $\boldsymbol{x}_T$ is a deterministic transformation. Therefore, by directly solving the ODE in reverse, we can obtain the inverse transformation from $\boldsymbol{x}_T$ to $\boldsymbol{x}_0$, which is also a deterministic transformation (substituting $\sigma_t=0$ into Equation $\eqref{eq:sde-reverse-2}$ also reveals that the forward and reverse equations are the same). This process is consistent with flow models (i.e., transforming noise into samples through a reversible transformation). Thus, the probability flow ODE allows us to correspond the results of diffusion models with related flow model results, such as performing exact likelihood calculation and obtaining latent variable representations as mentioned in the original paper—which are essentially the benefits of flow models. Due to the invertibility of flow models, it also allows us to perform image editing in the latent variable space.
On the other hand, the transformation from $\boldsymbol{x}_T$ to $\boldsymbol{x}_0$ is described by an ODE, which means we can use various high-order ODE numerical algorithms to accelerate the transformation process from $\boldsymbol{x}_T$ to $\boldsymbol{x}_0$. Of course, in principle, there are also some acceleration methods for solving SDEs, but the research on SDE acceleration is far less easy and in-depth than that for ODEs. Overall, compared to SDEs, ODEs appear much simpler and more direct in theoretical analysis and practical solving.
At the end of "Conversations on Generative Diffusion Models (4): DDIM = DDPM from a High-Level Perspective", we derived that the continuous version of DDIM corresponds to the ODE:
\begin{equation}\frac{d}{ds}\left(\frac{\boldsymbol{x}(s)}{\bar{\alpha}(s)}\right) = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\boldsymbol{x}(s), t(s)\right)\frac{d}{ds}\left(\frac{\bar{\beta}(s)}{\bar{\alpha}(s)}\right)\label{eq:ddim-ode}\end{equation}Next, we can see that this result is actually a special case of Equation $\eqref{eq:flow-ode}$ in this article when $\boldsymbol{f}_t(\boldsymbol{x})$ takes the linear function $f_t \boldsymbol{x}$. At the end of "Conversations on Generative Diffusion Models (5): SDE Perspective of the General Framework", we derived the corresponding relationships:
\begin{equation}\left\{\begin{aligned} &f_t = \frac{1}{\bar{\alpha}_t}\frac{d\bar{\alpha}_t}{dt} \\ &g^2 (t) = 2\bar{\alpha}_t \bar{\beta}_t \frac{d}{dt}\left(\frac{\bar{\beta}_t}{\bar{\alpha}_t}\right) \\ &\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}, t)}{\bar{\beta}_t} \end{aligned}\right.\end{equation}Substituting these relationships into Equation $\eqref{eq:flow-ode}$ [where $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$ is replaced by $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t)$] and simplifying, we get:
\begin{equation}\frac{1}{\bar{\alpha}_t}\frac{d\boldsymbol{x}}{dt} - \frac{\boldsymbol{x}}{\bar{\alpha}_t^2}\frac{d\bar{\alpha}_t}{dt} = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\frac{d}{dt}\left(\frac{\bar{\beta}_t}{\bar{\alpha}_t}\right)\end{equation}The left side can be further simplified to $\frac{d}{dt}\left(\frac{\boldsymbol{x}}{\bar{\alpha}_t}\right)$. Therefore, the above equation is completely equivalent to Equation $\eqref{eq:ddim-ode}$.
Building on the SDE article, this article derived a more generalized forward equation with the help of the F-P equation, subsequently derived the "Probability Flow ODE," and proved that DDIM is one of its special cases.