Talk on Generative Diffusion Models (Part 3): DDPM = Bayes + Denoising

By 苏剑林 | July 19, 2022

So far, I have provided two derivations for the generative diffusion model DDPM: the common analogy scheme in "Talk on Generative Diffusion Models (Part 1): DDPM = Demolishing + Building" and the Variational Autoencoder (VAE) scheme in "Talk on Generative Diffusion Models (Part 2): DDPM = Autoregressive VAE". These two schemes have their own characteristics; the former is more straightforward and easy to understand but cannot provide much theoretical extension or quantitative understanding, while the latter is more theoretically complete but slightly formal and lacks heuristic insight.

In this article, we will share another derivation of DDPM, which primarily utilizes Bayes' theorem to simplify calculations. The entire process has a strong sense of "deliberation" and is quite illuminating. Moreover, it is closely related to the DDIM model that we will introduce later.

Model Landscape

To recap, DDPM models the following transformation process:

\begin{equation}\boldsymbol{x} = \boldsymbol{x}_0 \rightleftharpoons \boldsymbol{x}_1 \rightleftharpoons \boldsymbol{x}_2 \rightleftharpoons \cdots \rightleftharpoons \boldsymbol{x}_{T-1} \rightleftharpoons \boldsymbol{x}_T = \boldsymbol{z}\end{equation}

The forward process gradually turns the sample data $\boldsymbol{x}$ into random noise $\boldsymbol{z}$, and the reverse process gradually turns random noise $\boldsymbol{z}$ back into sample data $\boldsymbol{x}$. The reverse process is the "generative model" we hope to obtain.

The forward process is simple; each step is:

\begin{equation}\boldsymbol{x}_t = \alpha_t \boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t,\quad \boldsymbol{\varepsilon}_t\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\end{equation}

Alternatively, it can be written as $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})=\mathcal{N}(\boldsymbol{x}_t;\alpha_t \boldsymbol{x}_{t-1},\beta_t^2 \boldsymbol{I})$. Under the constraint $\alpha_t^2 + \beta_t^2 = 1$, we have:

\begin{equation}\begin{aligned} \boldsymbol{x}_t =&\, \alpha_t \boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t \\ =&\, \alpha_t \big(\alpha_{t-1} \boldsymbol{x}_{t-2} + \beta_{t-1} \boldsymbol{\varepsilon}_{t-1}\big) + \beta_t \boldsymbol{\varepsilon}_t \\ =&\,\cdots\\ =&\,(\alpha_t\cdots\alpha_1) \boldsymbol{x}_0 + \underbrace{(\alpha_t\cdots\alpha_2)\beta_1 \boldsymbol{\varepsilon}_1 + (\alpha_t\cdots\alpha_3)\beta_2 \boldsymbol{\varepsilon}_2 + \cdots + \alpha_t\beta_{t-1} \boldsymbol{\varepsilon}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t}_{\sim \mathcal{N}(\boldsymbol{0}, (1-\alpha_t^2\cdots\alpha_1^2)\boldsymbol{I})} \end{aligned}\end{equation}

From this, we can derive $p(\boldsymbol{x}_t|\boldsymbol{x}_0)=\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_t \boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I})$, where $\bar{\alpha}_t = \alpha_1\cdots\alpha_t$ and $\bar{\beta}_t = \sqrt{1-\bar{\alpha}_t^2}$. What DDPM needs to achieve is to solve for the $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ required by the reverse process from this information. This way, we can start from any $\boldsymbol{x}_T=\boldsymbol{z}$ and step-by-step sample $\boldsymbol{x}_{T-1},\boldsymbol{x}_{T-2},\cdots,\boldsymbol{x}_1$, eventually obtaining the randomly generated sample data $\boldsymbol{x}_0=\boldsymbol{x}$.

Enter Bayes

Now we invite the great Bayes' theorem. In fact, directly according to Bayes' theorem, we have:

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) = \frac{p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})p(\boldsymbol{x}_{t-1})}{p(\boldsymbol{x}_t)}\label{eq:bayes}\end{equation}

However, we do not know the expressions for $p(\boldsymbol{x}_{t-1})$ and $p(\boldsymbol{x}_t)$, so this path is blocked. But we can settle for the next best thing and use Bayes' theorem given the condition $\boldsymbol{x}_0$:

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) = \frac{p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)}{p(\boldsymbol{x}_t|\boldsymbol{x}_0)}\end{equation}

This modification is natural because $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$, $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)$, and $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ are all known, making the above equation calculable. Substituting their respective expressions gives:

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) = \mathcal{N}\left(\boldsymbol{x}_{t-1};\frac{\alpha_t\bar{\beta}_{t-1}^2}{\bar{\beta}_t^2}\boldsymbol{x}_t + \frac{\bar{\alpha}_{t-1}\beta_t^2}{\bar{\beta}_t^2}\boldsymbol{x}_0,\frac{\bar{\beta}_{t-1}^2\beta_t^2}{\bar{\beta}_t^2} \boldsymbol{I}\right)\label{eq:p-xt-x0}\end{equation}

Derivation: The derivation of the above formula is not difficult; it is simply a routine expansion and rearrangement, though we can find some tricks to speed up the calculation. First, by substituting the respective expressions, it can be seen that after removing the $-1/2$ factor from the exponent, the result is:

\begin{equation}\frac{\Vert \boldsymbol{x}_t - \alpha_t \boldsymbol{x}_{t-1}\Vert^2}{\beta_t^2} + \frac{\Vert \boldsymbol{x}_{t-1} - \bar{\alpha}_{t-1}\boldsymbol{x}_0\Vert^2}{\bar{\beta}_{t-1}^2} - \frac{\Vert \boldsymbol{x}_t - \bar{\alpha}_t \boldsymbol{x}_0\Vert^2}{\bar{\beta}_t^2}\end{equation}

This is quadratic with respect to $\boldsymbol{x}_{t-1}$, so the final distribution must also be a normal distribution. We only need to find its mean and covariance. It is not difficult to see that the coefficient of the $\Vert \boldsymbol{x}_{t-1}\Vert^2$ term in the expansion is:

\begin{equation}\frac{\alpha_t^2}{\beta_t^2} + \frac{1}{\bar{\beta}_{t-1}^2} = \frac{\alpha_t^2\bar{\beta}_{t-1}^2 + \beta_t^2}{\bar{\beta}_{t-1}^2 \beta_t^2} = \frac{\alpha_t^2(1-\bar{\alpha}_{t-1}^2) + \beta_t^2}{\bar{\beta}_{t-1}^2 \beta_t^2} = \frac{1-\bar{\alpha}_t^2}{\bar{\beta}_{t-1}^2 \beta_t^2} = \frac{\bar{\beta}_t^2}{\bar{\beta}_{t-1}^2 \beta_t^2}\end{equation}

Therefore, the rearranged result must be in the form of $\frac{\bar{\beta}_t^2}{\bar{\beta}_{t-1}^2 \beta_t^2}\Vert \boldsymbol{x}_{t-1} - \tilde{\boldsymbol{\mu}}(\boldsymbol{x}_t, \boldsymbol{x}_0)\Vert^2$, which means the covariance matrix is $\frac{\bar{\beta}_{t-1}^2 \beta_t^2}{\bar{\beta}_t^2}\boldsymbol{I}$. On the other hand, isolating the coefficient of the linear term gives $-2\left(\frac{\alpha_t}{\beta_t^2}\boldsymbol{x}_t + \frac{\bar{\alpha}_{t-1}}{\bar{\beta}_{t-1}^2}\boldsymbol{x}_0 \right)$. Dividing this by the factor $\frac{-2\bar{\beta}_t^2}{\bar{\beta}_{t-1}^2 \beta_t^2}$ yields:

\begin{equation}\tilde{\boldsymbol{\mu}}(\boldsymbol{x}_t, \boldsymbol{x}_0)=\frac{\alpha_t\bar{\beta}_{t-1}^2}{\bar{\beta}_t^2}\boldsymbol{x}_t + \frac{\bar{\alpha}_{t-1}\beta_t^2}{\bar{\beta}_t^2}\boldsymbol{x}_0 \end{equation}

This gives us all the information for $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$, and the result is exactly equation $\eqref{eq:p-xt-x0}$.

Denoising Process

Now that we have $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$, it has an explicit solution, but it is not our desired final answer because we want to predict $\boldsymbol{x}_{t-1}$ using only $\boldsymbol{x}_t$, without depending on $\boldsymbol{x}_0$—$\boldsymbol{x}_0$ is the result we eventually want to generate. Next, a "daring" idea is: if we could predict $\boldsymbol{x}_0$ through $\boldsymbol{x}_t$, couldn't we eliminate $\boldsymbol{x}_0$ from $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ so that it only depends on $\boldsymbol{x}_t$?

Let's do it. We use $\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)$ to estimate $\boldsymbol{x}_0$, with the loss function being $\Vert \boldsymbol{x}_0 - \bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)\Vert^2$. Once the training is complete, we assume:

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) \approx p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0=\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)) = \mathcal{N}\left(\boldsymbol{x}_{t-1}; \frac{\alpha_t\bar{\beta}_{t-1}^2}{\bar{\beta}_t^2}\boldsymbol{x}_t + \frac{\bar{\alpha}_{t-1}\beta_t^2}{\bar{\beta}_t^2}\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t),\frac{\bar{\beta}_{t-1}^2\beta_t^2}{\bar{\beta}_t^2} \boldsymbol{I}\right)\label{eq:p-xt}\end{equation}

In $\Vert \boldsymbol{x}_0 - \bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)\Vert^2$, $\boldsymbol{x}_0$ represents the original data and $\boldsymbol{x}_t$ represents the noisy data, so this is essentially training a denoising model, which is the meaning of the first "D" in DDPM (Denoising). Specifically, $p(\boldsymbol{x}_t|\boldsymbol{x}_0)=\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_t \boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I})$ means $\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}$ where $\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$, or written as $\boldsymbol{x}_0 = \frac{1}{\bar{\alpha}_t}\left(\boldsymbol{x}_t - \bar{\beta}_t \boldsymbol{\varepsilon}\right)$. This inspires us to parameterize $\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)$ as:

\begin{equation}\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t) = \frac{1}{\bar{\alpha}_t}\left(\boldsymbol{x}_t - \bar{\beta}_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right)\label{eq:bar-mu}\end{equation}

At this point, the loss function becomes:

\begin{equation}\Vert \boldsymbol{x}_0 - \bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)\Vert^2 = \frac{\bar{\beta}_t^2}{\bar{\alpha}_t^2}\left\Vert\boldsymbol{\varepsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}, t)\right\Vert^2\end{equation}

Omitting the front coefficient, we get the loss function used in the original DDPM paper. It can be seen that this article directly derives the denoising process from $\boldsymbol{x}_t$ to $\boldsymbol{x}_0$, rather than deriving through the denoising process from $\boldsymbol{x}_t$ to $\boldsymbol{x}_{t-1}$ followed by integral transformation as in the previous two articles. In comparison, the derivation here is more immediate.

On the other hand, substituting equation $\eqref{eq:bar-mu}$ into equation $\eqref{eq:p-xt}$, we simplify it to get:

\begin{equation} p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) \approx p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0=\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)) = \mathcal{N}\left(\boldsymbol{x}_{t-1}; \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \frac{\beta_t^2}{\bar{\beta}_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right),\frac{\bar{\beta}_{t-1}^2\beta_t^2}{\bar{\beta}_t^2} \boldsymbol{I}\right)\end{equation}

This is the distribution used for the reverse sampling process, whereby the variance used in the sampling process is also determined. Thus, DDPM is derived!

(Note: For the sake of derivation flow, $\boldsymbol{\epsilon}_{\boldsymbol{\theta}}$ in this article is defined differently from the previous two but is consistent with the original DDPM paper.)

Derivation: The main difficulty in substituting equation $\eqref{eq:bar-mu}$ into equation $\eqref{eq:p-xt}$ is calculating:

\begin{equation}\begin{aligned}\frac{\alpha_t\bar{\beta}_{t-1}^2}{\bar{\beta}_t^2} + \frac{\bar{\alpha}_{t-1}\beta_t^2}{\bar{\alpha}_t\bar{\beta}_t^2} =&\, \frac{\alpha_t\bar{\beta}_{t-1}^2 + \beta_t^2/\alpha_t}{\bar{\beta}_t^2} = \frac{\alpha_t^2(1-\bar{\alpha}_{t-1}^2) + \beta_t^2}{\alpha_t\bar{\beta}_t^2} = \frac{1-\bar{\alpha}_t^2}{\alpha_t\bar{\beta}_t^2} = \frac{1}{\alpha_t} \end{aligned}\end{equation}

Predict-Correct

I wonder if the reader has noticed an interesting point: what we want to do is slowly turn $\boldsymbol{x}_T$ into $\boldsymbol{x}_0$. However, when we use $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ to approximate $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$, we include the step "using $\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)$ to estimate $\boldsymbol{x}_0$". If we could predict it accurately, wouldn't we just reach the end in one step? Why would we still need step-by-step sampling?

The reality is that "using $\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)$ to estimate $\boldsymbol{x}_0$" will certainly not be very accurate, at least not for quite a many steps at the beginning. It only serves as a forward-looking prediction, and then we use $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ to move forward just one small step. This is the "Predictor-Corrector" idea found in many numerical algorithms: we use a rough solution to project many steps forward, then use that rough result to move the final result forward by one small step, thereby gradually obtaining a more refined solution.

From this, we can also think of the Lookahead Optimizer: k steps forward, 1 step back proposed by Hinton three years ago. It similarly contains a prediction (k steps forward) and a correction (1 step back). The original paper interprets it as a combination of "Fast" and "Slow" weights, where the fast weight is the predicted result and the slow weight is the corrected result based on the prediction. If we wish, we can interpret the "Predict-Correct" process of DDPM in the same way.

Remaining Issues

Finally, in the section using Bayes' theorem, we said equation $\eqref{eq:bayes}$ cannot be used directly because $p(\boldsymbol{x}_{t-1})$ and $p(\boldsymbol{x}_t)$ are both unknown. By definition, we have:

\begin{equation}p(\boldsymbol{x}_t) = \int p(\boldsymbol{x}_t|\boldsymbol{x}_0)\tilde{p}(\boldsymbol{x}_0)d\boldsymbol{x}_0\end{equation}

While $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ is known, the data distribution $\tilde{p}(\boldsymbol{x}_0)$ cannot be known in advance, so the calculation cannot be performed. However, there are two special cases where both can be directly calculated, which we will provide here. These results happen to be the answer to the variance selection problem left over from the previous article.

The first example is when the entire dataset has only one sample. Without loss of generality, assume the sample is $\boldsymbol{0}$. At this time, $\tilde{p}(\boldsymbol{x}_0)$ is the Dirac distribution $\delta(\boldsymbol{x}_0)$, and we can directly calculate $p(\boldsymbol{x}_t)=p(\boldsymbol{x}_t|\boldsymbol{0})$. Substituting this into equation $\eqref{eq:bayes}$, we find the result is exactly the special case of $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t,\boldsymbol{x}_0)$ taking $\boldsymbol{x}_0=\boldsymbol{0}$:

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) = p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0=\boldsymbol{0}) = \mathcal{N}\left(\boldsymbol{x}_{t-1};\frac{\alpha_t\bar{\beta}_{t-1}^2}{\bar{\beta}_t^2}\boldsymbol{x}_t,\frac{\bar{\beta}_{t-1}^2\beta_t^2}{\bar{\beta}_t^2} \boldsymbol{I}\right)\end{equation}

We are primarily interested in its variance $\frac{\bar{\beta}_{t-1}^2\beta_t^2}{\bar{\beta}_t^2}$, which is one of the choices for sampling variance.

The second example is when the dataset follows a standard normal distribution, i.e., $\tilde{p}(\boldsymbol{x}_0)=\mathcal{N}(\boldsymbol{x}_0;\boldsymbol{0},\boldsymbol{I})$. We previously noted that $p(\boldsymbol{x}_t|\boldsymbol{x}_0)=\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_t \boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I})$ implies $\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}, \boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$. Given the assumption $\boldsymbol{x}_0\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$, by the property of additivity of normal distributions, $\boldsymbol{x}_t$ also follows a standard normal distribution. After substituting the probability density of the standard normal distribution into equation $\eqref{eq:bayes}$, the exponent (after removing the $-1/2$ factor) is:

\begin{equation}\frac{\Vert \boldsymbol{x}_t - \alpha_t \boldsymbol{x}_{t-1}\Vert^2}{\beta_t^2} + \Vert \boldsymbol{x}_{t-1}\Vert^2 - \Vert \boldsymbol{x}_t\Vert^2\end{equation}

Similar to the process of deriving $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t,\boldsymbol{x}_0)$, we can obtain the distribution corresponding to the above exponent as:

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) = \mathcal{N}\left(\boldsymbol{x}_{t-1};\alpha_t\boldsymbol{x}_t,\beta_t^2 \boldsymbol{I}\right)\end{equation}

We are equally interested in its variance $\beta_t^2$, which is another choice for sampling variance.

Article Summary

This article shared a derivation of DDPM with a strong sense of analytical "refinement." It uses Bayes' theorem to directly derive the reverse generative process, which is more immediate than the previous "Demolishing-Building" analogy and variational inference understanding. Simultaneously, it is more heuristic and closely linked to the DDIM model to be introduced next.