Conversations on Generative Diffusion Models (5): General Framework - SDE Edition

By 苏剑林 | August 03, 2022

When I wrote the first article on generative diffusion models, a reader in the comments recommended Dr. Yang Song's paper "Score-Based Generative Modeling through Stochastic Differential Equations". It can be said that this paper constructs a fairly generalized theoretical framework for generative diffusion models, linking results from DDPM, SDE, ODE, and many more. Admittedly, this is an excellent paper, but it is not suitable for beginners. It directly utilizes results from Stochastic Differential Equations (SDE), the Fokker-Planck equation, score matching, and other advanced topics, making the barrier to entry quite high.

However, after accumulating knowledge through the first four articles, we can now attempt to study this paper. In the following sections, I will try to replicate the derivation results of the original paper starting from as few theoretical foundations as possible.

Stochastic Differentiation

In DDPM, the diffusion process is divided into a fixed $T$ steps. Using the analogy from "Conversations on Generative Diffusion Models (1): DDPM = Demolition + Construction", both the "demolition" and "construction" of the building were pre-divided into $T$ steps, which is quite arbitrary. In fact, the real "demolition" and "construction" processes should not have intentionally divided steps; instead, we can understand them as continuous transformation processes over time, which can be described by a Stochastic Differential Equation (SDE).

To this end, we use the following SDE to describe the forward process ("demolition"): \begin{equation}d\boldsymbol{x} = \boldsymbol{f}_t(\boldsymbol{x}) dt + g_t d\boldsymbol{w}\label{eq:sde-forward}\end{equation} I believe many readers are unfamiliar with SDEs; I myself only encountered them briefly during my master's studies and only know the basics. But it doesn't matter if you don't understand it; we only need to view it as the limit of the following discrete form as $\Delta t \to 0$: \begin{equation}\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t = \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t + g_t \sqrt{\Delta t}\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\label{eq:sde-discrete}\end{equation} To put it more plainly, if we assume demolition takes 1 day, then demolition is the process of change in $\boldsymbol{x}$ from $t=0$ to $t=1$. The change in each small step can be described by the above equation. As for the time interval $\Delta t$, we place no special restrictions on it, except that a smaller $\Delta t$ means a better approximation of the original SDE. If we take $\Delta t=0.001$, it corresponds to the original $T=1000$; if $\Delta t = 0.01$, it corresponds to $T=100$, and so on. In other words, from the perspective of a continuous-time SDE, different $T$ values are merely manifestations of different degrees of discretization of the SDE. They naturally lead to similar results, and we do not need to specify $T$ in advance; instead, we choose an appropriate $T$ for numerical calculation based on the required precision in practical scenarios.

Therefore, the fundamental benefit of introducing the SDE form to describe diffusion models is "separating theoretical analysis from code implementation." We can use the mathematical tools of continuous SDEs to analyze it, and when putting it into practice, we only need to use any appropriate discretization scheme to perform numerical calculations on the SDE.

Regarding equation \eqref{eq:sde-discrete}, readers might be confused as to why the first term on the right is $\mathcal{O}(\Delta t)$ while the second term is $\mathcal{O}(\sqrt{\Delta t})$. That is, why is the order of the stochastic term higher than the order of the deterministic term? This is not actually easy to explain and is one of the confusing aspects of SDEs. Briefly, since $\boldsymbol{\varepsilon}$ always follows a standard normal distribution, if the weight of the stochastic term were also $\mathcal{O}(\Delta t)$, then because the mean of the standard normal distribution is $\boldsymbol{0}$ and the covariance is $\boldsymbol{I}$, nearby stochastic effects would cancel each other out. It must be amplified to $\mathcal{O}(\sqrt{\Delta t})$ for the stochastic effect to play a role in long-term results.

Reverse Equation

In probabilistic language, equation \eqref{eq:sde-discrete} implies a conditional probability of: \begin{equation}\begin{aligned} p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) =&\, \mathcal{N}\left(\boldsymbol{x}_{t+\Delta t};\boldsymbol{x}_t + \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t, g_t^2\Delta t \,\boldsymbol{I}\right)\\ \propto&\, \exp\left(-\frac{\Vert\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t - \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t\Vert^2}{2 g_t^2\Delta t}\right) \end{aligned}\label{eq:sde-proba}\end{equation} For the sake of simplicity, the irrelevant normalization factor is not written here. Following the idea of DDPM, we ultimately want to learn "construction" from the process of "demolition," i.e., to obtain $p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t})$. To do this, just as in "Conversations on Generative Diffusion Models (3): DDPM = Bayesian + Denoising", we use Bayes' theorem: \begin{equation}\begin{aligned} p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}) =&\, \frac{p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t)p(\boldsymbol{x}_t)}{p(\boldsymbol{x}_{t+\Delta t})} = p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) \exp\left(\log p(\boldsymbol{x}_t) - \log p(\boldsymbol{x}_{t+\Delta t})\right)\\ \propto&\, \exp\left(-\frac{\Vert\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t - \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t\Vert^2}{2 g_t^2\Delta t} + \log p(\boldsymbol{x}_t) - \log p(\boldsymbol{x}_{t+\Delta t})\right) \end{aligned}\label{eq:bayes-dt}\end{equation} It is not difficult to find that when $\Delta t$ is small enough, $p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t)$ will be significantly non-zero only when $\boldsymbol{x}_{t+\Delta t}$ is sufficiently close to $\boldsymbol{x}_t$. Conversely, only in this case will $p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t})$ be significantly non-zero. Therefore, we only need to perform an approximation analysis for cases where $\boldsymbol{x}_{t+\Delta t}$ and $\boldsymbol{x}_t$ are sufficiently close. For this, we can use a Taylor expansion: \begin{equation}\log p(\boldsymbol{x}_{t+\Delta t})\approx \log p(\boldsymbol{x}_t) + (\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t)\cdot \nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) + \Delta t \frac{\partial}{\partial t}\log p(\boldsymbol{x}_t)\end{equation} Note that the $\frac{\partial}{\partial t}$ term should not be ignored, because $p(\boldsymbol{x}_t)$ is actually the "probability density that the random variable at time $t$ equals $\boldsymbol{x}_t$," while $p(\boldsymbol{x}_{t+\Delta t})$ is the "probability density that the random variable at time $t+\Delta t$ equals $\boldsymbol{x}_{t+\Delta t}$." In other words, $p(\boldsymbol{x}_t)$ is actually a function of both $t$ and $\boldsymbol{x}_t$, so an extra partial derivative with respect to $t$ is required. Substituting this into equation \eqref{eq:bayes-dt} and completing the square gives: \begin{equation}p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}) \propto \exp\left(-\frac{\Vert\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t - \left[\boldsymbol{f}_t(\boldsymbol{x}_t) - g_t^2\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) \right]\Delta t\Vert^2}{2 g_t^2\Delta t} + \mathcal{O}(\Delta t)\right)\end{equation} As $\Delta t \to 0$, $\mathcal{O}(\Delta t)\to 0$ becomes irrelevant, therefore: \begin{equation}\begin{aligned} p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}) \propto&\, \exp\left(-\frac{\Vert\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t - \left[\boldsymbol{f}_t(\boldsymbol{x}_t) - g_t^2\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) \right]\Delta t\Vert^2}{2 g_t^2\Delta t}\right) \\ \approx&\,\exp\left(-\frac{\Vert \boldsymbol{x}_t - \boldsymbol{x}_{t+\Delta t} + \left[\boldsymbol{f}_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) - g_{t+\Delta t}^2\nabla_{\boldsymbol{x}_{t+\Delta t}}\log p(\boldsymbol{x}_{t+\Delta t}) \right]\Delta t\Vert^2}{2 g_{t+\Delta t}^2\Delta t}\right) \end{aligned}\end{equation} That is, $p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t})$ approximates a normal distribution with mean $\boldsymbol{x}_{t+\Delta t} - \left[\boldsymbol{f}_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) - g_{t+\Delta t}^2\nabla_{\boldsymbol{x}_{t+\Delta t}}\log p(\boldsymbol{x}_{t+\Delta t}) \right]\Delta t$ and covariance $g_{t+\Delta t}^2\Delta t\,\boldsymbol{I}$. Taking the limit $\Delta t\to 0$, this corresponds to the SDE: \begin{equation}d\boldsymbol{x} = \left[\boldsymbol{f}_t(\boldsymbol{x}) - g_t^2\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) \right] dt + g_t d\boldsymbol{w}\label{eq:reverse-sde}\end{equation} This is the SDE corresponding to the reverse process, which first appeared in "Reverse-Time Diffusion Equation Models". Here we have specifically added the subscript $t$ to $p$ to emphasize that this is the distribution at time $t$.

Score Matching

Now that we have obtained the reverse SDE as \eqref{eq:reverse-sde}, if we further know $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$, we can use the discretization format: \begin{equation}\boldsymbol{x}_t - \boldsymbol{x}_{t+\Delta t} = - \left[\boldsymbol{f}_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) - g_{t+\Delta t}^2\nabla_{\boldsymbol{x}_{t+\Delta t}}\log p(\boldsymbol{x}_{t+\Delta t}) \right]\Delta t - g_{t+\Delta t} \sqrt{\Delta t}\boldsymbol{\varepsilon}\label{eq:reverse-sde-discrete}\end{equation} to gradually complete the generative process of "construction" [where $\boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$], thereby completing the construction of a generative diffusion model.

So how do we obtain $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$? The $p_t(\boldsymbol{x})$ at time $t$ is the $p(\boldsymbol{x}_t)$ mentioned earlier, which represents the marginal distribution at time $t$. In practice, we generally design models so that an analytical solution for $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ can be found, which means: \begin{equation}\small p(\boldsymbol{x}_t|\boldsymbol{x}_0) = \lim_{\Delta t\to 0}\int\cdots\iint p(\boldsymbol{x}_t|\boldsymbol{x}_{t-\Delta t})p(\boldsymbol{x}_{t-\Delta t}|\boldsymbol{x}_{t-2\Delta t})\cdots p(\boldsymbol{x}_{\Delta t}|\boldsymbol{x}_0) d\boldsymbol{x}_{t-\Delta t} d\boldsymbol{x}_{t-2\Delta t}\cdots d\boldsymbol{x}_{\Delta t}\end{equation} can be solved directly. For example, when $\boldsymbol{f}_t(\boldsymbol{x})$ is a linear function of $\boldsymbol{x}$, $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ can be solved analytically. Under this premise, we have: \begin{equation}p(\boldsymbol{x}_t) = \int p(\boldsymbol{x}_t|\boldsymbol{x}_0)\tilde{p}(\boldsymbol{x}_0)d\boldsymbol{x}_0=\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]\end{equation} Thus: \begin{equation}\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) = \frac{\mathbb{E}_{\boldsymbol{x}_0}\left[\nabla_{\boldsymbol{x}_t} p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]}{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]} = \frac{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]}{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]}\end{equation} We can see that the final expression has the form of a "weighted average of $\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)$." Since we assumed $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ has an analytical solution, this expression can be directly estimated. However, it involves an average over all training samples $\boldsymbol{x}_0$, which on one hand requires a large amount of computation, and on the other hand, its generalization capability is insufficient. Therefore, we hope to use a neural network to learn a function $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)$ so that it can directly calculate $\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t)$.

Many readers should be familiar with the following result (or it's not hard to derive it once): \begin{equation}\mathbb{E}[\boldsymbol{x}] = \mathop{\text{argmin}}_{\boldsymbol{\mu}}\mathbb{E}_{\boldsymbol{x}}\left[\Vert \boldsymbol{\mu} - \boldsymbol{x}\Vert^2\right]\end{equation} That is, to make $\boldsymbol{\mu}$ equal to the mean of $\boldsymbol{x}$, one only needs to minimize the mean of $\Vert \boldsymbol{\mu} - \boldsymbol{x}\Vert^2$. Similarly, to make $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)$ equal to the weighted average of $\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ [i.e., $\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t)$], one only needs to minimize the weighted average of $\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2$, which is: \begin{equation} \frac{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right]}{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]}\end{equation} The denominator $\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]$ just serves to adjust the loss weight; for simplicity, we can remove it directly, which will not affect the result of the optimal solution. Finally, we integrate over $\boldsymbol{x}_t$ (equivalent to minimizing the above loss for every $\boldsymbol{x}_t$), obtaining the final loss function: \begin{equation}\begin{aligned}&\small \,\int \mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right] d\boldsymbol{x}_t \\ =&\, \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t \sim p(\boldsymbol{x}_t|\boldsymbol{x}_0)\tilde{p}(\boldsymbol{x}_0)}\left[\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right] \end{aligned}\label{eq:score-match}\end{equation} This is the "(conditional) score matching" loss function. The analytical solution for the denoising autoencoder we derived previously in "From Denoising Autoencoders to Generative Models" is also a special case of this. The earliest origin of score matching can be traced back to the 2005 paper "Estimation of Non-Normalized Statistical Models by Score Matching". As for the conditional score matching, the earliest source I traced is the 2011 paper "A Connection Between Score Matching and Denoising Autoencoders".

However, although the result is the same as score matching, in the derivation of this section, we have already set aside the concept of "score." It is purely an answer guided naturally by the objective. I believe such a treatment is more enlightening and hope this derivation can lower the difficulty of understanding score matching for everyone.

Reverse Derivation of Results

Up to this point, we have constructed a general workflow for generative diffusion models:

1. Define "demolition" (forward process) via stochastic differential equation \eqref{eq:sde-forward};

2. Solve for the expression of $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$;

3. Train $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)$ via loss function \eqref{eq:score-match} (score matching);

4. Replace $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$ in equation \eqref{eq:reverse-sde} with $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)$ to complete "construction" (reverse process).

Perhaps seeing terms like SDE and differential equations naturally makes people feel "panicked," but in essence, the SDE is just a "facade." In fact, once understanding of the SDE is converted to equation \eqref{eq:sde-discrete} and equation \eqref{eq:sde-proba}, the concept of SDE can be discarded. Therefore, conceptually, there isn't too much difficulty.

It is not difficult to find that defining a stochastic differential equation \eqref{eq:sde-forward} is easy, but solving for $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ from \eqref{eq:sde-forward} is not. The remainder of the original paper mainly consists of derivations and experiments for two practical examples. However, since solving for $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ is not easy, in my view, rather than first defining \eqref{eq:sde-forward} and then solving for $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$, would it not be better to do as in DDIM: define $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ first, and then reverse-engineer the corresponding SDE?

For example, we first define: \begin{equation} p(\boldsymbol{x}_t|\boldsymbol{x}_0) = \mathcal{N}(\boldsymbol{x}_t; \bar{\alpha}_t \boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I})\end{equation} And without loss of generality, assume the start is $t=0$ and the end is $t=1$, then the boundaries that $\bar{\alpha}_t, \bar{\beta}_t$ must satisfy are: \begin{equation} \bar{\alpha}_0 = 1,\quad \bar{\alpha}_1 = 0,\quad \bar{\beta}_0 = 0,\quad \bar{\beta}_1 = 1\end{equation} Of course, theoretically, these boundary conditions only need to be sufficiently approximate and don't necessarily have to be exactly equal. For example, in the previous article, we analyzed that DDPM is equivalent to choosing $\bar{\alpha}_t = e^{-5t^2}$, which gives $e^{-5}\approx 0$ when $t=1$.

With $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$, reversing \eqref{eq:sde-forward} essentially requires solving for $p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t)$, which must satisfy: \begin{equation} p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_0) = \int p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) p(\boldsymbol{x}_t|\boldsymbol{x}_0) d\boldsymbol{x}_t\end{equation} We consider a linear solution, namely: \begin{equation}d\boldsymbol{x} = f_t\boldsymbol{x} dt + g_t d\boldsymbol{w}\end{equation} As in "Conversations on Generative Diffusion Models (4): DDIM = High-viewpoint DDPM", we write:

Notation Meaning Sampling

$p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_0)$ $\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_{t+\Delta t} \boldsymbol{x}_0,\bar{\beta}_{t+\Delta t}^2 \boldsymbol{I})$ $\boldsymbol{x}_{t+\Delta t} = \bar{\alpha}_{t+\Delta t} \boldsymbol{x}_0 + \bar{\beta}_{t+\Delta t} \boldsymbol{\varepsilon}$

$p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ $\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_t \boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I})$ $\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}_1$

$p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t)$ $\mathcal{N}(\boldsymbol{x}_{t+\Delta t}; (1 + f_t\Delta t) \boldsymbol{x}_t, g_t^2 \Delta t\, \boldsymbol{I})$ $\boldsymbol{x}_{t+\Delta t} = (1 + f_t\Delta t) \boldsymbol{x}_t + g_t\sqrt{\Delta t}\boldsymbol{\varepsilon}_2$

$\begin{array}{c}\int p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) \\ p(\boldsymbol{x}_t|\boldsymbol{x}_0) d\boldsymbol{x}_t\end{array}$ $\begin{aligned}&\,\boldsymbol{x}_{t+\Delta t} \\ =&\, (1 + f_t\Delta t) \boldsymbol{x}_t + g_t\sqrt{\Delta t} \boldsymbol{\varepsilon}_2 \\ =&\, (1 + f_t\Delta t) (\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}_1) + g_t\sqrt{\Delta t} \boldsymbol{\varepsilon}_2 \\ =&\, (1 + f_t\Delta t) \bar{\alpha}_t \boldsymbol{x}_0 + ((1 + f_t\Delta t)\bar{\beta}_t \boldsymbol{\varepsilon}_1 + g_t\sqrt{\Delta t} \boldsymbol{\varepsilon}_2) \\ \end{aligned}$

Notation	Meaning	Sampling
$p(\boldsymbol{x}_{t+\Delta t}\|\boldsymbol{x}_0)$	$\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_{t+\Delta t} \boldsymbol{x}_0,\bar{\beta}_{t+\Delta t}^2 \boldsymbol{I})$	$\boldsymbol{x}_{t+\Delta t} = \bar{\alpha}_{t+\Delta t} \boldsymbol{x}_0 + \bar{\beta}_{t+\Delta t} \boldsymbol{\varepsilon}$
$p(\boldsymbol{x}_t\|\boldsymbol{x}_0)$	$\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_t \boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I})$	$\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}_1$
$p(\boldsymbol{x}_{t+\Delta t}\|\boldsymbol{x}_t)$	$\mathcal{N}(\boldsymbol{x}_{t+\Delta t}; (1 + f_t\Delta t) \boldsymbol{x}_t, g_t^2 \Delta t\, \boldsymbol{I})$	$\boldsymbol{x}_{t+\Delta t} = (1 + f_t\Delta t) \boldsymbol{x}_t + g_t\sqrt{\Delta t}\boldsymbol{\varepsilon}_2$
$\begin{array}{c}\int p(\boldsymbol{x}_{t+\Delta t}\|\boldsymbol{x}_t) \\ p(\boldsymbol{x}_t\|\boldsymbol{x}_0) d\boldsymbol{x}_t\end{array}$		$\begin{aligned}&\,\boldsymbol{x}_{t+\Delta t} \\ =&\, (1 + f_t\Delta t) \boldsymbol{x}_t + g_t\sqrt{\Delta t} \boldsymbol{\varepsilon}_2 \\ =&\, (1 + f_t\Delta t) (\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}_1) + g_t\sqrt{\Delta t} \boldsymbol{\varepsilon}_2 \\ =&\, (1 + f_t\Delta t) \bar{\alpha}_t \boldsymbol{x}_0 + ((1 + f_t\Delta t)\bar{\beta}_t \boldsymbol{\varepsilon}_1 + g_t\sqrt{\Delta t} \boldsymbol{\varepsilon}_2) \\ \end{aligned}$

From this we obtain: \begin{equation}\begin{aligned} \bar{\alpha}_{t+\Delta t} =&\, (1 + f_t\Delta t) \bar{\alpha}_t \\ \bar{\beta}_{t+\Delta t}^2 =&\, (1 + f_t\Delta t)^2\bar{\beta}_t^2 + g_t^2\Delta t \end{aligned}\end{equation} Letting $\Delta t\to 0$, we solve respectively: \begin{equation} f_t = \frac{d}{dt} \left(\ln \bar{\alpha}_t\right) = \frac{1}{\bar{\alpha}_t}\frac{d\bar{\alpha}_t}{dt}, \quad g_t^2 = \bar{\alpha}_t^2 \frac{d}{dt}\left(\frac{\bar{\beta}_t^2}{\bar{\alpha}_t^2}\right) = 2\bar{\alpha}_t \bar{\beta}_t \frac{d}{dt}\left(\frac{\bar{\beta}_t}{\bar{\alpha}_t}\right)\end{equation} When $\bar{\alpha}_t\equiv 1$, the result is the VE-SDE (Variance Exploding SDE) in the paper; whereas if $\bar{\alpha}_t^2 + \bar{\beta}_t^2=1$, the result is the VP-SDE (Variance Preserving SDE) in the original paper.

As for the loss function, at this point we can calculate: \begin{equation}\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0) = -\frac{\boldsymbol{x}_t - \bar{\alpha}_t\boldsymbol{x}_0}{\bar{\beta}_t^2}=-\frac{\boldsymbol{\varepsilon}}{\bar{\beta}_t}\end{equation} The second equality holds because $\boldsymbol{x}_t = \bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}$. To align with previous results, we set $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}{\bar{\beta}_t}$, and equation \eqref{eq:score-match} becomes: \begin{equation}\frac{1}{\bar{\beta}_t^2}\mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0),\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})}\left[\left\Vert \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}, t) - \boldsymbol{\varepsilon}\right\Vert^2\right]\end{equation} After ignoring the coefficient, this is the loss function of DDPM. Replacing $\nabla_{\boldsymbol{x}_{t+\Delta t}}\log p(\boldsymbol{x}_{t+\Delta t})$ in equation \eqref{eq:reverse-sde-discrete} with $-\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t+\Delta t}, t+\Delta t)}{\bar{\beta}_{t+\Delta t}}$, the result has the same first-order approximation as the sampling process of DDPM (meaning that they are equivalent as $\Delta t \to 0$).

Summary

This article mainly introduced the general framework for understanding diffusion models using SDE established by Dr. Yang Song. This included deriving results such as the reverse SDE and score matching in language as intuitive as possible, and providing my own thoughts on solving the equations.