By 苏剑林 | September 21, 2022
In "Generating Diffusion Models (10): Unified Diffusion Model (Theory)", the author claimed to have constructed a Unified Diffusion Model (UDM) framework that allows for more general diffusion methods and data types. Can the UDM framework actually achieve its intended purpose? This article demonstrates its generality through several specific examples.
First, UDM constructs the forward process by choosing a noise distribution $q(\boldsymbol{\varepsilon})$ and a transformation $\boldsymbol{\mathcal{F}}$:
\begin{equation}\boldsymbol{x}_t = \boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0,\boldsymbol{\varepsilon}),\quad \boldsymbol{\varepsilon}\sim q(\boldsymbol{\varepsilon})\end{equation}Then, the reverse process $\boldsymbol{x}_{t-1}\sim p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ is sampled via the following decomposition:
\begin{equation}\hat{\boldsymbol{x}}_0\sim p(\boldsymbol{x}_0|\boldsymbol{x}_t)\quad \& \quad \boldsymbol{x}_{t-1}\sim p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0=\hat{\boldsymbol{x}}_0)\end{equation}Where $p(\boldsymbol{x}_0|\boldsymbol{x}_t)$ is the probability of estimating $\boldsymbol{x}_0$ given $\boldsymbol{x}_t$, which is generally modeled using a simple distribution $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ as an approximation. The training objective is essentially $-\log q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ or a simple variation thereof. When $\boldsymbol{x}_0$ is continuous data, $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ is typically chosen as a conditional normal distribution; when $\boldsymbol{x}_0$ is discrete data, $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ can be an autoregressive or non-autoregressive model.
As for the most basic choice for $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$, it is:
\begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0) = p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)\quad\Leftrightarrow\quad \boldsymbol{x}_{t-1}=\boldsymbol{\mathcal{F}}_{t-1}(\boldsymbol{x}_0,\boldsymbol{\varepsilon})\end{equation}Starting from this baseline, different optimization results can be obtained under different conditions. If $\boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0,\boldsymbol{\varepsilon})$ is invertible with respect to $\boldsymbol{\varepsilon}$, we can solve for $\boldsymbol{\varepsilon} = \boldsymbol{\mathcal{F}}_t^{-1}(\boldsymbol{x}_0, \boldsymbol{x}_t)$, and then obtain a better deterministic sampling method:
\begin{equation}\boldsymbol{x}_{t-1} = \boldsymbol{\mathcal{F}}_{t-1}(\boldsymbol{x}_0,\boldsymbol{\mathcal{F}}_t^{-1}(\boldsymbol{x}_0,\boldsymbol{x}_t))\end{equation}Furthermore, if $q(\boldsymbol{\varepsilon})$ is a standard normal distribution, we can obtain:
\begin{equation}\quad\boldsymbol{x}_{t-1} = \boldsymbol{\mathcal{F}}_{t-1}(\boldsymbol{x}_0,\sqrt{1 - \tilde{\sigma}_t^2}\boldsymbol{\mathcal{F}}_t^{-1}(\boldsymbol{x}_0,\boldsymbol{x}_t) + \tilde{\sigma}_t \boldsymbol{\varepsilon})\end{equation}In this section, we prove that "Hot Diffusion Models" are a special case of UDM. Here, "Hot Diffusion" refers to mainstream diffusion models like DDPM and DDIM introduced previously. This term originates from the "Cold Diffusion" paper discussed below.
Mainstream diffusion models handle continuous data and construct the forward process using additive Gaussian noise:
\begin{equation}\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\end{equation}The choice for $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ is the normal distribution $\mathcal{N}(\boldsymbol{x}_0;\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t),\bar{\sigma}_t^2 \boldsymbol{I})$. Generally, $\bar{\sigma}_t$ is not treated as a training parameter, so after omitting the constant terms, we have:
\begin{equation}-\log q(\boldsymbol{x}_0|\boldsymbol{x}_t) = \frac{1}{2\bar{\sigma}_t^2}\Vert\boldsymbol{x}_0 - \bar{\boldsymbol{\mu}}(\boldsymbol{x}_t)\Vert^2\end{equation}Further introducing the parameterization $\bar{\boldsymbol{\mu}}(\boldsymbol{x}_t) = \frac{1}{\bar{\alpha}_t}\left(\boldsymbol{x}_t - \bar{\beta}_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right)$ and combining it with $\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}$, we get:
\begin{equation}-\log q(\boldsymbol{x}_0|\boldsymbol{x}_t) = \frac{\bar{\beta}_t^2}{2\bar{\sigma}_t^2\bar{\alpha}_t^2}\left\Vert\boldsymbol{\varepsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}, t)\right\Vert^2\end{equation}Experiments show that results are better when omitting the preceding coefficient, so the final training objective is generally $\Vert\boldsymbol{\varepsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}, t)\Vert^2$. For the choice of $\bar{\sigma}_t$ during the sampling process, one can refer to "Generating Diffusion Models (7): Optimal Diffusion Variance Estimation (Top)" and "Generating Diffusion Models (8): Optimal Diffusion Variance Estimation (Bottom)".
Finally, regarding $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$, we have:
\begin{equation}\begin{aligned}\boldsymbol{x}_{t-1} =& \bar{\alpha}_{t-1} \boldsymbol{x}_0 + \bar{\beta}_{t-1} \boldsymbol{\varepsilon}\\ \sim& \bar{\alpha}_{t-1} \boldsymbol{x}_0 + \sqrt{\bar{\beta}_{t-1}^2 - \sigma_t^2}\boldsymbol{\varepsilon}_1 + \sigma_t\boldsymbol{\varepsilon}_2\end{aligned}\,\,,\quad \boldsymbol{\varepsilon},\boldsymbol{\varepsilon}_1,\boldsymbol{\varepsilon}_2\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\end{equation}From $\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}$, we solve for $\boldsymbol{\varepsilon} = \left.(\boldsymbol{x}_t - \bar{\alpha}_t \boldsymbol{x}_0)\right/ \bar{\beta}_t$. Replacing $\boldsymbol{\varepsilon}_1$, we finally obtain the general $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ as:
\begin{equation}\boldsymbol{x}_{t-1} = \bar{\alpha}_{t-1} \boldsymbol{x}_0 + \sqrt{\bar{\beta}_{t-1}^2 - \sigma_t^2}\frac{\boldsymbol{x}_t - \bar{\alpha}_t \boldsymbol{x}_0}{\bar{\beta}_t} + \sigma_t\boldsymbol{\varepsilon},\quad\boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\end{equation}And $\hat{\boldsymbol{x}}_0\sim p(\boldsymbol{x}_0|\boldsymbol{x}_t)$ implies:
\begin{equation}\hat{\boldsymbol{x}}_0 = \bar{\boldsymbol{\mu}}(\boldsymbol{x}_t) + \bar{\sigma}_t \boldsymbol{\varepsilon} = \frac{1}{\bar{\alpha}_t}\left(\boldsymbol{x}_t - \bar{\beta}_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right) + \bar{\sigma}_t \boldsymbol{\varepsilon}\end{equation}Combining the above two equations yields the reverse process for the most general mainstream diffusion model framework. Specifically, DDPM takes $\bar{\sigma}_t=0, \sigma_t = \frac{\bar{\beta}_{t-1}\beta_t}{\bar{\beta}_t}$, DDIM takes $\bar{\sigma}_t=0, \sigma_t = 0$, while Analytical-DPM re-estimates an optimal non-zero $\bar{\sigma}_t$.
Next, we prove that "Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise" is also a special case of UDM. Cold Diffusion also handles continuous data. As the title suggests, it focuses on using arbitrary (noise-free) transformations to construct the forward process. To the author's knowledge, this is the first paper to attempt a general forward process. UDM received much inspiration from it during its construction, and the author thanks the original researchers.
Cold Diffusion constructs the forward process through a deterministic transformation $\boldsymbol{x}_t = \boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0)$. For the sake of subsequent analysis, we introduce a more general forward process:
\begin{equation}\boldsymbol{x}_t = \boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0) + \sigma \boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim q(\boldsymbol{\varepsilon})\end{equation}The transformation $\boldsymbol{\mathcal{F}}$ can be any form of degradation of the original data; for images, this includes blurring, masking, pooling, etc. If a deterministic transformation is needed, one can simply let $\sigma \to 0$ afterward.
Then, $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ is chosen as a distribution based on the $l_1$ norm metric, i.e.,
\begin{equation}q(\boldsymbol{x}_0|\boldsymbol{x}_t) = \frac{1}{Z(\tau)}\int e^{-\left.\Vert\boldsymbol{x}_0 - \boldsymbol{\mathcal{G}}_t(\boldsymbol{x}_t)\Vert_1\right/\tau}d\boldsymbol{x}_0\end{equation}where $Z(\tau)$ is the normalization factor. Taking $\tau$ as a fixed value, then omitting the constant terms, we have $-\log q(\boldsymbol{x}_0|\boldsymbol{x}_t)\propto\Vert\boldsymbol{x}_0 - \boldsymbol{\mathcal{G}}_t(\boldsymbol{x}_t)\Vert_1$. Combined with $\boldsymbol{x}_t = \boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0)$, the training objective becomes minimizing:
\begin{equation}\Vert\boldsymbol{x}_0 - \boldsymbol{\mathcal{G}}_t(\boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0))\Vert_1\end{equation}In the reverse process, Cold Diffusion directly ignores the variance of $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ (i.e., letting $\tau \to 0$), thus obtaining $\hat{\boldsymbol{x}}_0 = \boldsymbol{\mathcal{G}}_t(\boldsymbol{x}_t)$. If $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)$ uses the baseline choice $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)$, i.e., $\boldsymbol{x}_{t-1} = \boldsymbol{\mathcal{F}}_{t-1}(\boldsymbol{x}_0) + \sigma \boldsymbol{\varepsilon}$, then substituting $\hat{\boldsymbol{x}}_0$ and taking the limit $\sigma \to 0$ yields:
\begin{equation}\hat{\boldsymbol{x}}_0=\boldsymbol{\mathcal{G}}_t(\boldsymbol{x}_t),\quad \boldsymbol{x}_{t-1} = \boldsymbol{\mathcal{F}}_{t-1}(\hat{\boldsymbol{x}}_0)\end{equation}This is the "Naive Sampling" from the original paper. If we instead solve $\boldsymbol{\varepsilon} = \left.(\boldsymbol{x}_t - \boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0))\right/\sigma$ from $\boldsymbol{x}_t = \boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0) + \sigma \boldsymbol{\varepsilon}$ and substitute it into $\boldsymbol{x}_{t-1} = \boldsymbol{\mathcal{F}}_{t-1}(\boldsymbol{x}_0) + \sigma \boldsymbol{\varepsilon}$, we get:
\begin{equation}\hat{\boldsymbol{x}}_0=\boldsymbol{\mathcal{G}}_t(\boldsymbol{x}_t),\quad \boldsymbol{x}_{t-1} = \boldsymbol{x}_t + \boldsymbol{\mathcal{F}}_{t-1}(\hat{\boldsymbol{x}}_0) - \boldsymbol{\mathcal{F}}_t(\hat{\boldsymbol{x}}_0)\end{equation}This is the "Improved Sampling" from the original paper.
Overall, Cold Diffusion succeeded in implementing a forward process with general transformations for the first time, but because it emphasizes "Without Noise" too much, it has theoretical flaws that cannot be ignored. For example, for image data of size $w\times w\times 3$, if Cold Diffusion uses a blurring operation to implement the forward process, the final result is equivalent to a 3-dimensional vector. Since Cold Diffusion's reverse process is also deterministic, this means it converts $3w^2$-dimensional images into 3 dimensions via a deterministic transformation and then reconstructs them back to $3w^2$ dimensions. Significant information loss is inevitable in this intermediate process, which necessarily limits reconstruction clarity and thus generation quality.
To solve this, one cannot reject the existence of noise in either the forward or reverse processes. Noise implies uncertainty, and uncertainty implies "one-to-many." "One-to-many" allows for "many-to-one" forward processes—that is, it allows for the occurrence of information loss. In fact, Cold Diffusion itself realized that a 3-dimensional vector struggle to generate complete $3w^2$-dimensional data; during the generation process, it actually adds slight $3w^2$-dimensional random noise to the 3-dimensional vector, and experiments show that this operation improves generation results. This operation essentially corresponds to a forward process where $\sigma > 0$.
The above two examples both handle continuous data. However, as stated, UDM in principle does not restrict the data type. This section introduces a discrete example, showing that text generation models based on editing operations essentially can also be viewed as special cases of UDM.
To keep it simple, we consider the generation of fixed-length sentences of length $l$, such as five-character or seven-character quatrains. Variable-length sentences are possible but slightly more complex in detail. Then, we define the forward process $\boldsymbol{x}_t = \boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0,\boldsymbol{\varepsilon})$ as "random substitution":
Randomly select $t$ tokens in the sentence and randomly replace them with other tokens.
where $t \leq l$. When $t=l$, $\boldsymbol{x}_t$ consists of $l$ completely random tokens.
In this case, $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ is a model that predicts the original sequence from the substituted sequence, which can be implemented using either autoregressive or non-autoregressive models, with cross-entropy as the loss function. Note that $\boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0,\boldsymbol{\varepsilon})$ is inherently non-invertible with respect to noise (i.e., given $\boldsymbol{x}_0$ and $\boldsymbol{x}_t$, there is more than one way to transform $\boldsymbol{x}_0$ into $\boldsymbol{x}_t$). Therefore, we can only use the baseline choice $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{x}_0)=p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_0)$, which means the generation process is:
1. Randomly select $l$ tokens as the initial $\boldsymbol{x}_l$;
2. Predict $\hat{\boldsymbol{x}}_0$ from $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$;
3. Randomly select $t-1$ tokens from $\hat{\boldsymbol{x}}_0$ and randomly replace them with other tokens to serve as $\boldsymbol{x}_{t-1}$;
4. Repeat steps 2 and 3 until the final $\boldsymbol{x}_0$ is obtained.
However, the effectiveness of such an algorithm will not be very good, because the prediction from step 2 is often "destroyed" by the random substitution in step 3, creating a feeling of "one step forward, two steps back." To improve effectiveness, a better sampling scheme must be used, which requires $\boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0,\boldsymbol{\varepsilon})$ to be invertible with respect to noise—meaning that given $\boldsymbol{x}_0$ and $\boldsymbol{x}_t$, one can determine exactly how the transformation occurred. To this end, we define the forward process as:
Randomly select $t$ tokens in the sentence and replace them with different tokens.
The difference here is that during the random substitution, the original token must be replaced by a token that is different from it. If this restriction isn't made, it's possible to sample the same token. With this restriction, we can directly compare the differences between $\boldsymbol{x}_0$ and $\boldsymbol{x}_t$ to see what was modified, thereby replacing the random substitution in step 3 with a transformation from $\hat{\boldsymbol{x}}_0$ to $\boldsymbol{x}_t$:
1. Randomly select $l$ tokens as the initial $\boldsymbol{x}_l$;
2. Predict $\hat{\boldsymbol{x}}_0$ from $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$, requiring that $\hat{\boldsymbol{x}}_0$ and $\boldsymbol{x}_t$ have $t$ different tokens (better implemented using non-autoregressive models);
3. Randomly select one of the tokens in $\boldsymbol{x}_t$ that is different from $\hat{\boldsymbol{x}}_0$, and replace it with the token at the corresponding position in $\hat{\boldsymbol{x}}_0$, using this as $\boldsymbol{x}_{t-1}$;
4. Repeat steps 2 and 3 until the final $\boldsymbol{x}_0$ is obtained.
In this way, the effective parts of each prediction $\hat{\boldsymbol{x}}_0$ (the parts where $\hat{\boldsymbol{x}}_0$ is identical to $\boldsymbol{x}_t$) are preserved, and since $\boldsymbol{x}_{t-1}$ only modifies one token compared to $\boldsymbol{x}_t$, the generation process is a stable, progressive generation. The difference between this and a standard autoregressive model is the removal of the left-to-right generation direction constraint.
If the above model is still vague, here is a simpler example to aid understanding. Again, consider the generation of a fixed-length sentence of length $l$. We define the forward process $\boldsymbol{x}_t = \boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0,\boldsymbol{\varepsilon})$ as "random masking":
Randomly select $t$ tokens in the sentence and replace them with [MASK].
Where $t \leq l$. When $t=l$, $\boldsymbol{x}_t$ consists of $l$ [MASK] tokens.
Here, $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$ is a model that predicts the original sequence from the sequence with [MASK] tokens, generally implemented using an MLM model (non-autoregressive model) similar to BERT, with cross-entropy as the loss function. The baseline generation process is:
1. Start with $l$ [MASK] tokens as the initial $\boldsymbol{x}_l$;
2. Sample $\hat{\boldsymbol{x}}_0$ from $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$;
3. Randomly select $t-1$ tokens from $\hat{\boldsymbol{x}}_0$ and replace them with [MASK] to serve as $\boldsymbol{x}_{t-1}$;
4. Repeat steps 2 and 3 until the final $\boldsymbol{x}_0$ is obtained.
Note that in this case, $\boldsymbol{\mathcal{F}}_t(\boldsymbol{x}_0,\boldsymbol{\varepsilon})$ is invertible with respect to noise—we can clearly see which tokens were replaced by [MASK] from given $\boldsymbol{x}_0$ and $\boldsymbol{x}_t$. Therefore, we can construct an improved generation process:
1. Start with $l$ [MASK] tokens as the initial $\boldsymbol{x}_l$;
2. Sample $\hat{\boldsymbol{x}}_0$ from $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$, noting that only the tokens that were originally [MASK] need to be sampled; the non-[MASK] tokens are left unchanged;
3. Randomly select $t-1$ positions from the $t$ [MASK] locations in the original $\boldsymbol{x}_t$, replace the tokens at these positions in $\hat{\boldsymbol{x}}_0$ with [MASK], and use this as $\boldsymbol{x}_{t-1}$;
4. Repeat steps 2 and 3 until the final $\boldsymbol{x}_0$ is obtained.
Of course, steps 2 and 3 can be combined more directly:
2 & 3. Randomly select 1 position from the $t$ [MASK] locations in $\boldsymbol{x}_t$, sample a token based on the probability at that position from $q(\boldsymbol{x}_0|\boldsymbol{x}_t)$, replace it, and use this as $\boldsymbol{x}_{t-1}$.
This is almost identical to Gibbs sampling based on an MLM model (refer to "Searched Text (3): Text Sampling Based on BERT"). From the "Editing Model" and "Masking Model" examples, we can see that many "progressive generation" models can be reformulated using the UDM framework. Or conversely, any progressive generation method we can think of can be attempted using the UDM framework to build its probabilistic representation.
Previously, the forward processes we discussed were without trainable parameters—they were pre-designed. However, this is not strictly necessary. We can generalize the diffusion process of DDPM as:
\begin{equation}\boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{\mathcal{F}}(\boldsymbol{x}_0) + \bar{\beta}_t \boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\end{equation}where $\boldsymbol{\mathcal{F}}(\boldsymbol{x}_0)$ is an encoding model for $\boldsymbol{x}_0$, which can have trainable parameters. In this case, the training objective is:
\begin{equation}-\log q(\boldsymbol{x}_0|\boldsymbol{x}_t) = -\log q(\boldsymbol{x}_0|\bar{\alpha}_t\boldsymbol{\mathcal{F}}(\boldsymbol{x}_0) + \bar{\beta}_t \boldsymbol{\varepsilon})\end{equation}except that $\boldsymbol{\mathcal{F}}$ now also has trainable parameters. The reverse process is similar, except that once $\hat{\boldsymbol{x}}_0\sim q(\boldsymbol{x}_0|\boldsymbol{x}_1)$ is sampled, it is returned directly. In particular, because of the addition of the encoding model $\boldsymbol{\mathcal{F}}$, the input $\boldsymbol{x}_0$ can be either discrete or continuous data. This provides a method similar to VAE for encoding data distributions into a normal distribution of latent variables.
This article primarily uses the Unified Diffusion Model (UDM) framework built in the previous post to derive several concrete examples, including mainstream diffusion models, Cold Diffusion, text editing generation, and encoding models.