Diffusion Models Part 4: DDIM = DDPM from a High-Level Perspective

By 苏剑林 | July 27, 2022

I believe many readers have heard of, or even read, Felix Klein's series of books Elementary Mathematics from an Advanced Standpoint. As the name suggests, this involves re-examining elementary mathematics learned in the past from a higher perspective after studying more in-depth and complete mathematical knowledge, in order to gain a more comprehensive understanding and even achieve the effect of "reviewing the old to know the new." There are many similar books, such as Calculus Revisited and Visual Complex Analysis. Returning to diffusion models, we have already interpreted DDPM from different perspectives through three articles. Does there also exist a "higher" perspective for understanding it that allows us to gain new insights? Certainly there is; the DDIM model introduced in Denoising Diffusion Implicit Models is a classic case. Let us appreciate it in this article.

Thinking Analysis

In Diffusion Models Part 3: DDPM = Bayesian + Denoising, we mentioned that the derivation introduced in that article is closely related to DDIM. Specifically, the derivation route of that article can be briefly summarized as follows:

\begin{equation}p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})\xrightarrow{\text{derivation}}p(\boldsymbol{x}_t|\boldsymbol{x}_0)\xrightarrow{\text{derivation}}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)\xrightarrow{\text{approximation}}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)\end{equation}

This process is a step-by-step progression. However, we find that the final result has two characteristics: 1. The loss function only depends on $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$; 2. The sampling process only depends on $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$. That is to say, although the entire process starts from $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$ and moves forward step by step, from the perspective of the result, $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$ doesn't seem to play a role at all. So, let's take a bold "leap of imagination":

High Perspective 1: Since the result is unrelated to $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$, can we simply "burn the bridge" and remove $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$ from the entire derivation process?

DDIM is precisely the product of this "leap of imagination"!

Undetermined Coefficients

Some readers might think: according to Bayes' theorem used in the previous article

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) = \frac{p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)}{p(\boldsymbol{x}_t|\boldsymbol{x}_0)}\end{equation}

How can we obtain $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ without being given $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$? This is actually a mindset that is too fixed. Theoretically, in the case where $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$ is not given, the solution space for $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ is larger. In a sense, it is easier to derive. In this case, it only needs to satisfy the marginal distribution condition:

\begin{equation}\int p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) p(\boldsymbol{x}_t|\boldsymbol{x}_0) d\boldsymbol{x}_t = p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)\label{eq:margin}\end{equation}

We use the method of undetermined coefficients to solve this equation. In the previous article, the solved $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ was a normal distribution, so this time we can more generally set

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) = \mathcal{N}(\boldsymbol{x}_{t-1}; \kappa_t \boldsymbol{x}_t + \lambda_t \boldsymbol{x}_0, \sigma_t^2 \boldsymbol{I})\end{equation}

where $\kappa_t, \lambda_t, \sigma_t$ are all undetermined coefficients. To avoid re-training the model, we do not change $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)$ and $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$. Thus, we can list the following:

Notation Meaning Sampling
$p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)$ $\mathcal{N}(\boldsymbol{x}_{t-1};\bar{\alpha}_{t-1} \boldsymbol{x}_0,\bar{\beta}_{t-1}^2 \boldsymbol{I})$ $\boldsymbol{x}_{t-1} = \bar{\alpha}_{t-1} \boldsymbol{x}_0 + \bar{\beta}_{t-1} \boldsymbol{\varepsilon}$
$p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ $\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_t \boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I})$ $\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}_1$
$p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ $\mathcal{N}(\boldsymbol{x}_{t-1}; \kappa_t \boldsymbol{x}_t + \lambda_t \boldsymbol{x}_0, \sigma_t^2 \boldsymbol{I})$ $\boldsymbol{x}_{t-1} = \kappa_t \boldsymbol{x}_t + \lambda_t \boldsymbol{x}_0 + \sigma_t \boldsymbol{\varepsilon}_2$
${\begin{array}{c}\int p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) \\ p(\boldsymbol{x}_t|\boldsymbol{x}_0) d\boldsymbol{x}_t\end{array}}$ ${\begin{aligned}\boldsymbol{x}_{t-1} =&\, \kappa_t \boldsymbol{x}_t + \lambda_t \boldsymbol{x}_0 + \sigma_t \boldsymbol{\varepsilon}_2 \\ =&\, \kappa_t (\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}_1) + \lambda_t \boldsymbol{x}_0 + \sigma_t \boldsymbol{\varepsilon}_2 \\ =&\, (\kappa_t \bar{\alpha}_t + \lambda_t) \boldsymbol{x}_0 + (\kappa_t\bar{\beta}_t \boldsymbol{\varepsilon}_1 + \sigma_t \boldsymbol{\varepsilon}_2) \\ \end{aligned}}$

where $\boldsymbol{\varepsilon},\boldsymbol{\varepsilon}_1,\boldsymbol{\varepsilon}_2\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})$. From the additivity of normal distributions, we know that $\kappa_t\bar{\beta}_t \boldsymbol{\varepsilon}_1 + \sigma_t \boldsymbol{\varepsilon}_2\sim \sqrt{\kappa_t^2\bar{\beta}_t^2 + \sigma_t^2} \boldsymbol{\varepsilon}$. Comparing the two sampling forms of $\boldsymbol{x}_{t-1}$, we find that for equation $\eqref{eq:margin}$ to hold, only two equations need to be satisfied:

\begin{equation}\bar{\alpha}_{t-1} = \kappa_t \bar{\alpha}_t + \lambda_t, \qquad\bar{\beta}_{t-1} = \sqrt{\kappa_t^2\bar{\beta}_t^2 + \sigma_t^2}\end{equation}

It can be seen that there are three unknowns but only two equations. This is why it was said that the solution space is even larger when $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$ is not given. Treating $\sigma_t$ as a variable parameter, we can solve for:

\begin{equation}\kappa_t = \frac{\sqrt{\bar{\beta}_{t-1}^2 - \sigma_t^2}}{\bar{\beta}_t},\qquad \lambda_t = \bar{\alpha}_{t-1} - \frac{\bar{\alpha}_t\sqrt{\bar{\beta}_{t-1}^2 - \sigma_t^2}}{\bar{\beta}_t}\end{equation}

Or written as:

\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) = \mathcal{N}\left(\boldsymbol{x}_{t-1}; \frac{\sqrt{\bar{\beta}_{t-1}^2 - \sigma_t^2}}{\bar{\beta}_t} \boldsymbol{x}_t + \left(\bar{\alpha}_{t-1} - \frac{\bar{\alpha}_t\sqrt{\bar{\beta}_{t-1}^2 - \sigma_t^2}}{\bar{\beta}_t}\right) \boldsymbol{x}_0, \sigma_t^2 \boldsymbol{I}\right)\label{eq:p-xt-x0}\end{equation}

For convenience, we define $\bar{\alpha}_0=1, \bar{\beta}_0=0$. In particular, this result does not require the restriction $\bar{\alpha}_t^2 + \bar{\beta}_t^2 = 1$, though to simplify parameter settings and align with previous results, we still adopt $\bar{\alpha}_t^2 + \bar{\beta}_t^2 = 1$ here.

Business as Usual

Now, given only $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ and $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)$, we have used the method of undetermined coefficients to solve for a family of solutions for $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$, which carries a free parameter $\sigma_t$. Using the "demolishing-constructing" analogy from Diffusion Models Part 1, it means we know what the building will look like after being demolished [$p(\boldsymbol{x}_t|\boldsymbol{x}_0), p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)$], but we don't know the exact steps of demolition [$p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$], and then we hope to learn the steps of construction from this [$p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$]. Of course, if we want to see how each step is demolished, we can conversely use the Bayesian formula:

\begin{equation} p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}, \boldsymbol{x}_0) = \frac{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) p(\boldsymbol{x}_t|\boldsymbol{x}_0)}{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)}\end{equation}

The next steps are exactly the same as in the previous article: we ultimately want $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ rather than $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$, so we hope to use

\begin{equation}\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t) = \frac{1}{\bar{\alpha}_t}\left(\boldsymbol{x}_t - \bar{\beta}_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right)\end{equation}

to estimate $\boldsymbol{x}_0$. Since $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$ has not been changed, the target function used for training is still $\left\Vert\boldsymbol{\varepsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}, t)\right\Vert^2$ (excluding weight coefficients). This means the training process remains unchanged, and we can reuse the model trained by DDPM. After replacing $\boldsymbol{x}_0$ in equation $\eqref{eq:p-xt-x0}$ with $\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)$, we get:

\begin{equation}\begin{aligned} p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) \approx&\, p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0=\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)) \\ =&\, \mathcal{N}\left(\boldsymbol{x}_{t-1}; \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \left(\bar{\beta}_t - \alpha_t\sqrt{\bar{\beta}_{t-1}^2 - \sigma_t^2}\right) \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right), \sigma_t^2 \boldsymbol{I}\right) \end{aligned}\label{eq:p-xt-x0-2}\end{equation}

This yields the $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ required for the generation process, where $\alpha_t=\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}$. Its characteristic is that the training process does not change (meaning the final saved model is the same), but the generation process has an adjustable parameter $\sigma_t$. It is this parameter that brings fresh results to DDPM.

Some Examples

In principle, we do not have many constraints on $\sigma_t$, but different $\sigma_t$ values will result in different characteristics in the sampling process. Let's analyze a few examples.

The first simple example is taking $\sigma_t = \frac{\bar{\beta}_{t-1}\beta_t}{\bar{\beta}_t}$, where $\beta_t = \sqrt{1 - \alpha_t^2}$. Accordingly, we have:

\begin{equation}\small{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) \approx p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0=\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)) = \mathcal{N}\left(\boldsymbol{x}_{t-1}; \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \frac{\beta_t^2}{\bar{\beta}_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right),\frac{\bar{\beta}_{t-1}^2\beta_t^2}{\bar{\beta}_t^2} \boldsymbol{I}\right)}\label{eq:choice-1}\end{equation}

This is the DDPM derived in the previous article. Notably, the DDIM paper also conducted comparative experiments with $\sigma_t = \eta\frac{\bar{\beta}_{t-1}\beta_t}{\bar{\beta}_t}$, where $\eta\in[0, 1]$.

The second example is taking $\sigma_t = \beta_t$, which is one of the two choices for $\sigma_t$ pointed out in the first two articles. Under this choice, formula $\eqref{eq:p-xt-x0-2}$ cannot be further simplified, but DDIM's experimental results show that this choice performs very well under DDPM's standard parameter settings.

The most special example is taking $\sigma_t = 0$. In this case, the transformation from $\boldsymbol{x}_t$ to $\boldsymbol{x}_{t-1}$ is a deterministic transformation:

\begin{equation}\boldsymbol{x}_{t-1} = \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \left(\bar{\beta}_t - \alpha_t \bar{\beta}_{t-1}\right) \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right)\label{eq:sigma=0}\end{equation}

This is also a case that the DDIM paper is particularly concerned with. Strictly speaking, DDIM in the original paper specifically refers to the case where $\sigma_t=0$. The "I" stands for "Implicit," meaning this is an implicit probabilistic model because, unlike other choices, the generated result $\boldsymbol{x}_0$ starting from a given $\boldsymbol{x}_T = \boldsymbol{z}$ is not stochastic. As we will see later, this brings some benefits both theoretically and practically.

Accelerating Generation

It is worth pointing out that in this article we did not start from $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$, so all previous results are actually given in terms of notations related to $\bar{\alpha}_t, \bar{\beta}_t$. In contrast, $\alpha_t, \beta_t$ are notations derived via $\alpha_t=\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}$ and $\beta_t = \sqrt{1 - \alpha_t^2}$. From the loss function $\left\Vert\boldsymbol{\varepsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}, t)\right\Vert^2$, it can be seen that once the various $\bar{\alpha}_t$ values are given, the training process is determined.

From this process, DDIM further noted the following fact:

High Perspective 2: DDPM's training results essentially contain the training results for any of its subsequence parameters.

Specifically, let $\boldsymbol{\tau} = [\tau_1,\tau_2,\dots,\tau_{\dim(\boldsymbol{\tau})}]$ be any subsequence of $[1,2,\cdots,T]$. If we train a DDPM with diffusion steps $\dim(\boldsymbol{\tau})$ using $\bar{\alpha}_{\tau_1},\bar{\alpha}_{\tau_2},\cdots,\bar{\alpha}_{\dim(\boldsymbol{\tau})}$ as parameters, its target function is actually a subset of the target function of the original $T$-step DDPM with parameters $\bar{\alpha}_1,\bar{\alpha}_2,\cdots,\bar{\alpha}_T$! Therefore, when the model fitting capability is good enough, it actually contains the training results for any subsequence parameters.

Then, thinking in reverse, if we have a trained $T$-step DDPM model, we can also treat it as a $\dim(\boldsymbol{\tau})$-step model trained with parameters $\bar{\alpha}_{\tau_1},\bar{\alpha}_{\tau_2},\cdots,\bar{\alpha}_{\dim(\boldsymbol{\tau})}$. Since it is a $\dim(\boldsymbol{\tau})$-step model, the generation process only needs $\dim(\boldsymbol{\tau})$ steps. According to formula $\eqref{eq:p-xt-x0-2}$, we have:

\begin{equation}p(\boldsymbol{x}_{\tau_{i-1}}|\boldsymbol{x}_{\tau_i}) \approx \mathcal{N}\left(\boldsymbol{x}_{\tau_{i-1}}; \frac{\bar{\alpha}_{\tau_{i-1}}}{\bar{\alpha}_{\tau_i}}\left(\boldsymbol{x}_{\tau_i} - \left(\bar{\beta}_{\tau_i} - \frac{\bar{\alpha}_{\tau_i}}{\bar{\alpha}_{\tau_{i-1}}}\sqrt{\bar{\beta}_{\tau_{i-1}}^2 - \tilde{\sigma}_{\tau_i}^2}\right) \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau_i}, \tau_i)\right), \tilde{\sigma}_{\tau_i}^2 \boldsymbol{I}\right)\end{equation}

This is the generation process for accelerated sampling, going from the original $T$ diffusion steps to $\dim(\boldsymbol{\tau})$ steps. Note that you cannot simply replace $\alpha_t$ in formula $\eqref{eq:p-xt-x0-2}$ with $\alpha_{\tau_i}$ because, as we said, $\alpha_t$ is just a derived notation; it actually equals $\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}$. Thus, $\alpha_t$ should be replaced by $\frac{\bar{\alpha}_{\tau_i}}{\bar{\alpha}_{\tau_{i-1}}}$. Similarly, $\tilde{\sigma}_{\tau_i}$ is not taken directly as $\sigma_{\tau_i}$, but instead, after converting its definition entirely into $\bar{\alpha}, \bar{\beta}$ symbols, replace $t$ with $\tau_i$ and $t-1$ with $\tau_{i-1}$. For example, the $\tilde{\sigma}_{\tau_i}$ corresponding to equation $\eqref{eq:choice-1}$ is:

\begin{equation}\sigma_t = \frac{\bar{\beta}_{t-1}\beta_t}{\bar{\beta}_t}=\frac{\bar{\beta}_{t-1}}{\bar{\beta}_t}\sqrt{1 - \frac{\bar{\alpha}_t^2}{\bar{\alpha}_{t-1}^2}}\quad\to\quad\frac{\bar{\beta}_{\tau_{i-1}}}{\bar{\beta}_{\tau_i}}\sqrt{1 - \frac{\bar{\alpha}_{\tau_i}^2}{\bar{\alpha}_{\tau_{i-1}}^2}}=\tilde{\sigma}_{\tau_i}\end{equation}

Readers might ask again: why don't we just directly train a $\dim(\boldsymbol{\tau})$-step diffusion model instead of training $T > \dim(\boldsymbol{\tau})$ steps and then doing subsequence sampling? I believe there are two considerations: on the one hand, regarding $\dim(\boldsymbol{\tau})$-step generation, training a model with more steps might enhance generalization ability; on the other hand, accelerating via a subsequence $\boldsymbol{\tau}$ is just one acceleration method. Training a more complete $T$-step model allows us to try more diverse acceleration methods without significantly increasing training costs.

Experimental Results

The original paper performed combinations of different noise intensities and diffusion step counts $\dim(\boldsymbol{\tau})$. Roughly speaking, the result is "the smaller the noise, the better the generation effect after acceleration," as shown in the figure below:

DDIM Experimental Results

DDIM experimental results show that the smaller the noise, the better the generation effect after acceleration.

My own reference implementation is as follows:

https://github.com/bojone/ddim

My personal experimental conclusions are:

  1. Contrary to intuition, the smaller the $\sigma_t$ in the generation process, the relatively larger the noise and diversity of the final generated image;
  2. The fewer the diffusion steps $\dim(\boldsymbol{\tau})$, the smoother the generated images, and diversity will also decrease;
  3. Combining points 1 and 2, when the number of diffusion steps $\dim(\boldsymbol{\tau})$ decreases, $\sigma_t$ can be appropriately reduced to maintain roughly constant image quality, which is consistent with the experimental conclusions of the original DDIM paper;
  4. When $\sigma_t$ is small, compared to a trainable Embedding layer, using fixed Sinusoidal encoding to represent $t$ results in less noise in the generated images;
  5. When $\sigma_t$ is small, the U-Net architecture from the original paper (ddpm2.py in the Github repo) results in less noise than my own improvised U-Net architecture (ddpm.py in the Github repo);
  6. However, generally speaking, the generation quality without noise is not as good as the generation quality with noise. Without noise, the performance is heavily influenced by the model architecture.

Furthermore, for DDIM with $\sigma_t=0$, it is a deterministic transformation that transforms any normal noise vector into an image. This is almost identical to GANs. Thus, similar to GANs, we can perform interpolation on noise vectors and observe the corresponding generation effects. Note, however, that DDPM or DDIM are sensitive to the noise distribution, so we cannot use linear interpolation but must use spherical interpolation. This is because, due to the additivity of normal distributions, if $\boldsymbol{z}_1,\boldsymbol{z}_2\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$, then $\lambda\boldsymbol{z}_1 + (1-\lambda)\boldsymbol{z}_2$ generally does not follow $\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$. Instead, we use:

\begin{equation}\boldsymbol{z} = \boldsymbol{z}_1 \cos\frac{\lambda\pi}{2} + \boldsymbol{z}_2 \sin\frac{\lambda\pi}{2},\quad \lambda\in[0, 1]\end{equation}

Interpolation effect demonstration (model trained by myself):

DDIM Interpolation

Interpolation generation effect of DDIM random vectors.

Differential Equations

Finally, let's focus on analyzing the case where $\sigma_t = 0$. At this point, $\eqref{eq:sigma=0}$ can be equivalently rewritten as:

\begin{equation}\frac{\boldsymbol{x}_t}{\bar{\alpha}_t} - \frac{\boldsymbol{x}_{t-1}}{\bar{\alpha}_{t-1}} = \left(\frac{\bar{\beta}_t}{\bar{\alpha}_t} - \frac{\bar{\beta}_{t-1}}{\bar{\alpha}_{t-1}}\right) \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\end{equation}

When $T$ is large enough, or rather when the difference between $\alpha_t$ and $\alpha_{t-1}$ is small enough, we can view the above equation as a finite difference form of some ordinary differential equation (ODE). Specifically, introducing a virtual time parameter $s$, we obtain:

\begin{equation}\frac{d}{ds}\left(\frac{\boldsymbol{x}(s)}{\bar{\alpha}(s)}\right) = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\boldsymbol{x}(s), t(s)\right)\frac{d}{ds}\left(\frac{\bar{\beta}(s)}{\bar{\alpha}(s)}\right)\label{eq:ode}\end{equation}

Without loss of generality, assume $s\in[0,1]$, where $s=0$ corresponds to $t=0$ and $s=1$ corresponds to $t=T$. Note that the original DDIM paper directly used $\frac{\bar{\beta}(s)}{\bar{\alpha}(s)}$ as the virtual time parameter, which is in principle not ideal because its range is $[0, \infty)$, and unbounded intervals are detrimental to numerical solution.

So now, what we need to do is to solve for $\boldsymbol{x}(0)$ given $\boldsymbol{x}(1)\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})$. The iteration process of DDPM or DDIM corresponds to the Euler method for this ODE. It is well known that the Euler method is relatively the slowest in efficiency. If one wants to accelerate the solution, methods like Heun's method or Runge-Kutta (R-K) methods can be used. That is to say, equating the generation process to solving an ODE provides a richer and more diverse set of tools for accelerating the generation process by leveraging numerical methods for ODEs.

Taking DDPM's default parameters $T=1000, \alpha_t = \sqrt{1 - \frac{0.02t}{T}}$ as an example, we repeat the estimate from Part 1:

\begin{equation}\log \bar{\alpha}_t = \sum_{i=k}^t \log\alpha_k = \frac{1}{2} \sum_{k=1}^t \log\left(1 - \frac{0.02k}{T}\right) < \frac{1}{2} \sum_{k=1}^t \left(- \frac{0.02k}{T}\right) = -\frac{0.005t(t+1)}{T}\end{equation}

In fact, since each $\alpha_k$ is very close to 1, the above estimate is also a very good approximation. Since our starting point in this article is $p(\boldsymbol{x}_t|\boldsymbol{x}_0)$, we should start with $\bar{\alpha}_t$. Based on the above approximation, we can simply take:

\begin{equation}\bar{\alpha}_t = \exp\left(-\frac{0.005t^2}{T}\right) = \exp\left(-\frac{5t^2}{T^2}\right)\end{equation}

If we take $s=t/T$ as the parameter, then exactly $s\in[0,1]$, and $\bar{\alpha}(s)=e^{-5s^2}$. Substituting this into equation $\eqref{eq:ode}$ and simplifying, we get:

\begin{equation}\frac{d\boldsymbol{x}(s)}{ds} = 10s\left(\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\boldsymbol{x}(s), sT\right)}{\sqrt{1-e^{-10s^2}}} - \boldsymbol{x}(s)\right)\end{equation}

We can also take $s=t^2/T^2$ as the parameter. Then we also have $s\in[0,1]$ and $\bar{\alpha}(s)=e^{-5s}$. Substituting this into equation $\eqref{eq:ode}$ and simplifying, we get:

\begin{equation}\frac{d\boldsymbol{x}(s)}{ds} = 5\left(\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\boldsymbol{x}(s), \sqrt{s}T\right)}{\sqrt{1-e^{-10s}}} - \boldsymbol{x}(s)\right)\end{equation}

Summary

This article follows the derivation logic of the previous DDPM article to introduce DDIM. It re-examines the starting point of DDPM, eliminates $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$ from the derivation process, thereby obtaining a broader class of solutions and ideas for accelerating the generation process. Finally, this new class of solutions allows us to link the generation process with the solution of ordinary differential equations, thus enabling further research into the generation process using ODE methods.