By 苏剑林 | August 30, 2022
The previous articles have mostly focused on theoretical results. In this article, we will discuss a topic of significant practical value: conditional controlled generation.
As generative models, the development of diffusion models closely mirrors that of VAEs, GANs, and flow models: unconditional generation appeared first, followed closely by conditional generation. Unconditional generation is often used to explore the upper bounds of performance, while conditional generation is more focused on application, as it allows us to control the output according to our intentions. Since the advent of DDPM, many works on conditional diffusion models have emerged; one could even say it was conditional diffusion models—such as the famous text-to-image models DALL·E 2 and Imagen—that truly made diffusion models popular.
In this article, we will briefly study and summarize the theoretical foundations of conditional diffusion models.
Methodologically, there are two primary ways to achieve conditional generation: post-hoc modification (Classifier-Guidance) and pre-training (Classifier-Free).
For most people, the cost of training a SOTA-level diffusion model is prohibitive, whereas training a classifier is manageable. Therefore, the idea is to reuse an existing unconditional diffusion model trained by others and use a classifier to adjust the generation process to achieve controlled generation—this is the post-hoc Classifier-Guidance scheme. On the other hand, for "deep-pocketed" companies like Google and OpenAI, which do not lack data or computing power, they prefer to incorporate conditional signals directly into the training process of the diffusion model to achieve better generation quality—this is the pre-training Classifier-Free scheme.
The Classifier-Guidance scheme originated from "Diffusion Models Beat GANs on Image Synthesis", initially used for class-conditional generation. Later, "More Control for Free! Image Synthesis with Semantic Diffusion Guidance" generalized the concept of the "Classifier," enabling generation based on images or text. The Classifier-Guidance scheme has lower training costs (NLP readers might recall the similar PPLM model), but higher inference costs, and its control over fine details is often not as precise.
As for the Classifier-Free scheme, it first appeared in "Classifier-Free Diffusion Guidance". Subsequent eye-catching models like DALL·E 2 and Imagen are basically built upon it. It is worth mentioning that although the paper was only posted to Arxiv last month, it was actually accepted at NeurIPS 2021 last year. It should be said that the Classifier-Free scheme itself has little theoretical novelty; it is the most straightforward approach to conditional diffusion models. It appeared later simply because of the high cost of re-training diffusion models. Given sufficient data and computing power, the Classifier-Free scheme has demonstrated astonishing detail-control capabilities.
Essentially, the Classifier-Free scheme is just expensive to train and "lacks technical complexity" in itself, so most of this article will focus on the Classifier-Guidance scheme, with Classifier-Free introduced briefly at the end.
As readers of our previous articles know, the most critical step in a generative diffusion model is the construction of the generation process $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$. For generation conditioned on an input $\boldsymbol{y}$, we simply replace $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ with $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})$, meaning we add $\boldsymbol{y}$ as an input to the generation process. To reuse a pre-trained unconditional model $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$, we use Bayes' theorem: \begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{y}) = \frac{p(\boldsymbol{x}_{t-1})p(\boldsymbol{y}|\boldsymbol{x}_{t-1})}{p(\boldsymbol{y})}\end{equation} By adding the condition $\boldsymbol{x}_t$ to each term, we get: \begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) = \frac{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)p(\boldsymbol{y}|\boldsymbol{x}_{t-1}, \boldsymbol{x}_t)}{p(\boldsymbol{y}|\boldsymbol{x}_t)}\label{eq:bayes-1}\end{equation} Note that in the forward process, $\boldsymbol{x}_t$ is obtained by adding noise to $\boldsymbol{x}_{t-1}$. Since noise does not help with classification, adding $\boldsymbol{x}_t$ provides no benefit for classification given $\boldsymbol{x}_{t-1}$. Thus, $p(\boldsymbol{y}|\boldsymbol{x}_{t-1}, \boldsymbol{x}_t) = p(\boldsymbol{y}|\boldsymbol{x}_{t-1})$, leading to: \begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) = \frac{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)p(\boldsymbol{y}|\boldsymbol{x}_{t-1})}{p(\boldsymbol{y}|\boldsymbol{x}_t)} = p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) e^{\log p(\boldsymbol{y}|\boldsymbol{x}_{t-1}) - \log p(\boldsymbol{y}|\boldsymbol{x}_t)}\label{eq:bayes-2}\end{equation}
Readers who have read "Talk on Generative Diffusion Models (5): General Framework SDE Edition" will find the following process familiar. However, even if you haven't, we will provide the full derivation below.
When $T$ is sufficiently large, the variance of $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1})$ is small enough that the probability is only significantly greater than 0 when $\boldsymbol{x}_t$ is very close to $\boldsymbol{x}_{t-1}$. The converse is also true: $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})$ or $p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}, \boldsymbol{y})$ is only significant when $\boldsymbol{x}_t$ and $\boldsymbol{x}_{t-1}$ are very close. We only need to focus on probability changes within this range. Using a Taylor expansion: \begin{equation}\log p(\boldsymbol{y}|\boldsymbol{x}_{t-1}) - \log p(\boldsymbol{y}|\boldsymbol{x}_t)\approx (\boldsymbol{x}_{t-1} - \boldsymbol{x}_t)\cdot\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)\end{equation} Strictly speaking, there is also a term regarding the change in $t$, but it does not depend on $\boldsymbol{x}_{t-1}$ and is therefore a constant term that does not affect the probability distribution of $\boldsymbol{x}_{t-1}$, so we omit it. Assuming the original $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) = \mathcal{N}(\boldsymbol{x}_{t-1}; \boldsymbol{\mu}(\boldsymbol{x}_t), \sigma_t^2\boldsymbol{I}) \propto e^{-\Vert \boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t)\Vert^2/2\sigma_t^2}$, we then have the approximation: \begin{equation}\begin{aligned} p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) \propto&\, e^{-\Vert \boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t)\Vert^2/2\sigma_t^2 + (\boldsymbol{x}_{t-1} - \boldsymbol{x}_t)\cdot\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)} \\ \propto&\, e^{-\Vert \boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t) - \sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t))\Vert^2/2\sigma_t^2} \end{aligned}\end{equation} From this, it can be seen that $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})$ is approximately $\mathcal{N}(\boldsymbol{x}_{t-1}; \boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t), \sigma_t^2\boldsymbol{I})$. Thus, we only need to modify the sampling in the generation process to: \begin{equation}\boldsymbol{x}_{t-1} = \boldsymbol{\mu}(\boldsymbol{x}_t) \color{skyblue}{+} {\color{skyblue}{\underbrace{\sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)}_{\text{Additional Term}}}} + \sigma_t\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\end{equation} This is the core result of the Classifier-Guidance scheme. Note that the input to $p(\boldsymbol{y}|\boldsymbol{x}_t)$ is the noisy sample $\boldsymbol{x}_t$, which means we need a model that can predict from noisy samples. If we only have a model $p_o(\boldsymbol{y}|\boldsymbol{x})$ for noise-free samples, we can consider using $p(\boldsymbol{y}|\boldsymbol{x}_t)$ as: \begin{equation}p(\boldsymbol{y}|\boldsymbol{x}_t) = p_{o}(\boldsymbol{y}|\boldsymbol{\mu}(\boldsymbol{x}_t))\end{equation} That is, we use $\boldsymbol{\mu}(\cdot)$ to denoise $\boldsymbol{x}_t$ before passing it to $p(\boldsymbol{y}|\boldsymbol{x}_t)$, which avoids the cost of training a classifier on noisy samples.
The original paper ("Diffusion Models Beat GANs on Image Synthesis") found that introducing a scaling parameter $\gamma$ into the classifier gradient can better adjust the generation effect: \begin{equation}\boldsymbol{x}_{t-1} = \boldsymbol{\mu}(\boldsymbol{x}_t) \color{skyblue}{+} \color{skyblue}{\sigma_t^2 \color{red}{\gamma}\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)} + \sigma_t\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\label{eq:gamma-sample}\end{equation} When $\gamma > 1$, the generation process will use more classifier signals. This improves the correlation between the generation result and the input signal $\boldsymbol{y}$, but it reduces the diversity of the generated results accordingly. Conversely, decreasing $\gamma$ reduces the correlation but increases diversity.
How do we understand this parameter theoretically? The original paper suggested viewing it as increasing the focus of the distribution via a power operation, i.e., defining: \begin{equation}\tilde{p}(\boldsymbol{y}|\boldsymbol{x}_t) = \frac{p^{\gamma}(\boldsymbol{y}|\boldsymbol{x}_t)}{Z(\boldsymbol{x}_t)},\quad Z(\boldsymbol{x}_t)=\sum_{\boldsymbol{y}} p^{\gamma}(\boldsymbol{y}|\boldsymbol{x}_t)\end{equation} As $\gamma$ increases, the prediction $\tilde{p}(\boldsymbol{y}|\boldsymbol{x}_t)$ becomes closer to a one-hot distribution. Using this instead of $p(\boldsymbol{y}|\boldsymbol{x}_t)$ for Classifier-Guidance makes the generation process tend to pick samples with higher classification confidence.
However, while this perspective has some reference value, it isn't entirely correct because: \begin{equation}\nabla_{\boldsymbol{x}_t}\log \tilde{p}(\boldsymbol{y}|\boldsymbol{x}_t) = \gamma\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t) - \nabla_{\boldsymbol{x}_t} \log Z(\boldsymbol{x}_t) \neq \gamma\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)\end{equation} The original paper mistakenly assumed that $Z(\boldsymbol{x}_t)$ is a constant, so $\nabla_{\boldsymbol{x}_t} \log Z(\boldsymbol{x}_t)=0$. But in fact, when $\gamma\neq 1$, $Z(\boldsymbol{x}_t)$ explicitly depends on $\boldsymbol{x}_t$. The author also gave some thought to whether there is any remedy for this, but unfortunately, there were no results; we can only weakly assume that the gradient properties at $\gamma=1$ (where $Z(\boldsymbol{x}_t)=1$) can be approximately generalized to the case where $\gamma\neq 1$.
In fact, the best way to understand $\gamma\neq 1$ is to give up on understanding $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})$ through Bayes' theorem as in equations $\eqref{eq:bayes-1}$ and $\eqref{eq:bayes-2}$, and instead directly define: \begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) = \frac{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) e^{\gamma\cdot\text{sim}(\boldsymbol{x}_{t-1}, \boldsymbol{y})}}{Z(\boldsymbol{x}_t, \boldsymbol{y})},\quad Z(\boldsymbol{x}_t,\boldsymbol{y})=\sum_{\boldsymbol{x}_{t-1}} p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) e^{\gamma\cdot\text{sim}(\boldsymbol{x}_{t-1}, \boldsymbol{y})}\end{equation} where $\text{sim}(\boldsymbol{x}_{t-1}, \boldsymbol{y})$ is some measure of similarity or correlation between the generation result $\boldsymbol{x}_{t-1}$ and the condition $\boldsymbol{y}$. Under this perspective, $\gamma$ is directly integrated into the definition of $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})$, controlling the correlation between result and condition. When $\gamma$ is larger, the model tends to generate $\boldsymbol{x}_{t-1}$ that is more relevant to $\boldsymbol{y}$.
To further obtain a sampleable approximation, we can expand at $\boldsymbol{x}_{t-1}=\boldsymbol{x}_t$ (or at $\boldsymbol{x}_{t-1}=\boldsymbol{\mu}(\boldsymbol{x}_t)$, as before): \begin{equation}e^{\gamma\cdot\text{sim}(\boldsymbol{x}_{t-1}, \boldsymbol{y})}\approx e^{\gamma\cdot\text{sim}(\boldsymbol{x}_t, \boldsymbol{y}) + \gamma\cdot(\boldsymbol{x}_{t-1}-\boldsymbol{x}_t)\cdot\nabla_{\boldsymbol{x}_t}\text{sim}(\boldsymbol{x}_t, \boldsymbol{y})} \end{equation} Assuming this approximation is sufficient, and removing terms independent of $\boldsymbol{x}_{t-1}$, we get: \begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})\propto p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)e^{\gamma\cdot(\boldsymbol{x}_{t-1}-\boldsymbol{x}_t)\cdot\nabla_{\boldsymbol{x}_t}\text{sim}(\boldsymbol{x}_t, \boldsymbol{y})} \end{equation} As before, substituting $p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)=\mathcal{N}(\boldsymbol{x}_{t-1}; \boldsymbol{\mu}(\boldsymbol{x}_t), \sigma_t^2\boldsymbol{I})$ and completing the square results in: \begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})\approx \mathcal{N}(\boldsymbol{x}_{t-1}; \boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2\gamma \nabla_{\boldsymbol{x}_t} \text{sim}(\boldsymbol{x}_t, \boldsymbol{y}), \sigma_t^2\boldsymbol{I}) \end{equation}
This way, we don't have to worry about the probabilistic meaning of $p(\boldsymbol{y}|\boldsymbol{x}_t)$, but only need to directly define the metric function $\text{sim}(\boldsymbol{x}_t, \boldsymbol{y})$. Here, $\boldsymbol{y}$ is no longer limited to "categories"; it can be any input signal such as text or images. Usually, these are processed by their respective encoders into feature vectors, and cosine similarity is used: \begin{equation}\text{sim}(\boldsymbol{x}_t, \boldsymbol{y}) = \frac{E_1(\boldsymbol{x}_t)\cdot E_2(\boldsymbol{y})}{\Vert E_1(\boldsymbol{x}_t)\Vert \Vert E_2(\boldsymbol{y})\Vert}\end{equation} It should be pointed out that $\boldsymbol{x}_t$ in the intermediate process is noisy, so the encoder $E_1$ generally should not be the one trained on clean data; it's better to fine-tune it with noisy data. Furthermore, for style transfer, Gram matrix distance is typically used instead of cosine similarity, depending on the scenario. These are the series of results from the paper "More Control for Free! Image Synthesis with Semantic Diffusion Guidance". For more details, please refer to the original paper.
After the previous derivations, we found that the correction term for the mean is $\sigma_t^2 \gamma \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)$ or $\sigma_t^2\gamma \nabla_{\boldsymbol{x}_t} \text{sim}(\boldsymbol{x}_t, \boldsymbol{y})$. They share a common trait: when $\sigma_t=0$, the correction term also equals 0, and the control fails.
So, can $\sigma_t$ in the generation process be equal to 0? Certainly. For example, the DDIM introduced in "Talk on Generative Diffusion Models (4): DDIM = High-level perspective DDPM" is a generation process with zero variance. In this case, how do we perform controlled generation? Here, we need to use the general SDE-based results introduced in "Talk on Generative Diffusion Models (6): General Framework ODE Edition". In that article, we introduced that for a forward SDE: \begin{equation}d\boldsymbol{x} = \boldsymbol{f}_t(\boldsymbol{x}) dt + g_t d\boldsymbol{w}\end{equation} The corresponding most general reverse SDE is: \begin{equation}d\boldsymbol{x} = \left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 + \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right) dt + \sigma_t d\boldsymbol{w}\end{equation} This allows us to freely choose the reverse variance $\sigma_t^2$. DDPM and DDIM can both be seen as special cases, where $\sigma_t=0$ corresponds to a generalized DDIM. As can be seen, the only part of the reverse SDE related to the input is $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$. If we want to do conditional generation, we naturally replace it with $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}|\boldsymbol{y})$. Using Bayes' theorem: \begin{equation}\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}|\boldsymbol{y}) = \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) + \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{y}|\boldsymbol{x})\end{equation} Under common parameterization, $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}{\bar{\beta}_t}$, therefore: \begin{equation}\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}|\boldsymbol{y}) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}{\bar{\beta}_t} + \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{y}|\boldsymbol{x}) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \bar{\beta}_t\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{y}|\boldsymbol{x})}{\bar{\beta}_t}\end{equation} This means that regardless of the generation variance, we only need to replace $\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)$ with $\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \bar{\beta}_t\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{y}|\boldsymbol{x})$ to achieve conditional controlled generation. Therefore, under the unified perspective of SDEs, we can very simply and directly obtain the most general results for the Classifier-Guidance scheme.
Finally, we briefly introduce the Classifier-Free scheme. It's actually very simple: it directly defines: \begin{equation}p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) = \mathcal{N}(\boldsymbol{x}_{t-1}; \boldsymbol{\mu}(\boldsymbol{x}_t, \boldsymbol{y}), \sigma_t^2\boldsymbol{I}) \end{equation} Following the results from the previous DDPM articles, $\boldsymbol{\mu}(\boldsymbol{x}_t, \boldsymbol{y})$ is generally parameterized as: \begin{equation}\boldsymbol{\mu}(\boldsymbol{x}_t, \boldsymbol{y}) = \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \frac{\beta_t^2}{\bar{\beta}_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{y}, t)\right)\end{equation} The loss function for training is: \begin{equation}\mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{y}\sim\tilde{p}(\boldsymbol{x}_0,\boldsymbol{y}), \boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\Vert\boldsymbol{\varepsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}, \boldsymbol{y}, t)\right\Vert^2\right]\end{equation} Its advantage is that the additional input $\boldsymbol{y}$ is introduced during the training process; theoretically, more input information makes training easier. Its disadvantage is also that $\boldsymbol{y}$ is introduced during training, which means every time you want to introduce a new type of signal control, you must retrain the entire diffusion model.
Notably, the Classifier-Free scheme also mimicked the scaling mechanism of the $\gamma$ parameter from Classifier-Guidance to balance correlation and diversity. Specifically, the mean in equation $\eqref{eq:gamma-sample}$ can be rewritten as: \begin{equation}\boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \gamma \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t) = \gamma\left[\boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)\right] - (\gamma - 1) \boldsymbol{\mu}(\boldsymbol{x}_t)\end{equation} The Classifier-Free scheme essentially fits $\boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)$ directly with the model. By analogy with the equation above, we can also introduce a parameter $w = \gamma - 1$ in the Classifier-Free scheme and use: \begin{equation}\tilde{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{y}, t) = (1 + w)\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{y}, t) - w \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\end{equation} to replace $\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{y}, t)$ during generation. Where does the unconditional $\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)$ come from? We can introduce a specific input $\boldsymbol{\phi}$, whose target images consist of all images, and include it in the model training. This way, we can consider $\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{\phi}, t)$.
This article briefly introduced the theoretical results for building conditional diffusion models, mainly including post-hoc modification (Classifier-Guidance) and pre-training (Classifier-Free) schemes. The former does not require retraining the diffusion model and can achieve simple control at low cost; the latter requires retraining the diffusion model, which is more expensive but allows for more refined control.