Diffusion Model Musings (18): Score Matching = Conditional Score Matching

By 苏剑林 | February 28, 2023

In previous introductions, we have repeatedly mentioned "Score Matching" and "Conditional Score Matching." These are concepts that frequently appear in diffusion models, energy-based models, and the like. In particular, many articles directly state that the training objective of diffusion models is "Score Matching," but in fact, the training objective of current mainstream diffusion models such as DDPM is actually "Conditional Score Matching."

So, what is the specific relationship between "Score Matching" and "Conditional Score Matching"? Are they equivalent? This article discusses this question in detail.

Score Matching

First, Score Matching refers to the training objective:

\begin{equation}\mathbb{E}_{\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) - \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2\right]\label{eq:sm}\end{equation}

where $\boldsymbol{\theta}$ represents the trainable parameters. Obviously, Score Matching aims to learn a model $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$ to approximate $\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)$. We call $\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)$ the "score."

In the context of diffusion models, $p_t(\boldsymbol{x}_t)$ is given by:

\begin{equation}p_t(\boldsymbol{x}_t) = \int p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)p_0(\boldsymbol{x}_0)d\boldsymbol{x}_0 = \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)}\left[p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]\label{eq:pt}\end{equation}

where $p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$ is generally a simple distribution with a known analytical probability density formula (such as a conditional normal distribution), and $p_0(\boldsymbol{x}_0)$ is a given distribution. However, $p_0(\boldsymbol{x}_0)$ usually represents the training data, meaning we can only sample from $p_0(\boldsymbol{x}_0)$ but do not know its specific analytical expression.

According to equation \eqref{eq:pt}, we can derive:

\begin{equation}\begin{aligned} \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) =&\, \frac{\nabla_{\boldsymbol{x}_t}p_t(\boldsymbol{x}_t)}{p_t(\boldsymbol{x}_t)} \\ =&\, \frac{\int p_0(\boldsymbol{x}_0)\nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)d\boldsymbol{x}_0}{p_t(\boldsymbol{x}_t)} \\ =&\, \frac{\int p_0(\boldsymbol{x}_0)\nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)d\boldsymbol{x}_0}{\int p_0(\boldsymbol{x}_0) p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)d\boldsymbol{x}_0} \\ =&\, \frac{\mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)}\left[\nabla_{\boldsymbol{x}_t}p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]}{\mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)}\left[p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]} \\ \end{aligned}\label{eq:score-1}\end{equation}

Based on our assumptions, both $\nabla_{\boldsymbol{x}_t}p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$ and $p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$ have known analytical expressions. Therefore, in theory, we can estimate $\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)$ by sampling $\boldsymbol{x}_0$. However, since this involves the division of two expectations, it is a biased estimate (refer to "A Brief Description of Unbiased and Biased Estimation"). Consequently, many points need to be sampled to make a relatively accurate estimation. Thus, using equation \eqref{eq:sm} directly as a training objective requires a large batch size to achieve good results.

Conditional Score

In practice, the training objective used by general diffusion models is "Conditional Score Matching":

\[\mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t\sim p_0(\boldsymbol{x}_0)p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) - \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2\right]\]

By assumption, $\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$ has a known analytical expression, so the above objective is directly usable by sampling pairs of $(\boldsymbol{x}_0, \boldsymbol{x}_t)$. Notably, this is an unbiased estimate, which means it does not particularly rely on a large batch size and is therefore a more practical training objective.

To analyze the relationship between "Score Matching" and "Conditional Score Matching," we also need another identity for $\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)$:

\begin{equation}\begin{aligned} \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) =&\, \frac{\nabla_{\boldsymbol{x}_t}p_t(\boldsymbol{x}_t)}{p_t(\boldsymbol{x}_t)} \\ =&\, \frac{\int p_0(\boldsymbol{x}_0)\nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)d\boldsymbol{x}_0}{p_t(\boldsymbol{x}_t)} \\ =&\, \frac{\int p_0(\boldsymbol{x}_0)p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\nabla_{\boldsymbol{x}_t} \log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) d\boldsymbol{x}_0}{p_t(\boldsymbol{x}_t)} \\ =&\, \int p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)\nabla_{\boldsymbol{x}_t} \log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) d\boldsymbol{x}_0 \\ =&\, \mathbb{E}_{\boldsymbol{x}_0\sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\nabla_{\boldsymbol{x}_t} \log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right] \\ \end{aligned}\label{eq:score-2}\end{equation}

Inequality Relation

First, we can quickly prove the first result between the two: Conditional Score Matching is an upper bound of Score Matching. This implies that minimizing Conditional Score Matching to some extent also minimizes Score Matching.

The proof is not difficult; we have already proved this in "Diffusion Model Musings (16): W-distance ≤ Score Matching":

\begin{equation}\begin{aligned} &\,\mathbb{E}_{\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) - \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2\right] \\ =&\,\mathbb{E}_{\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t)}\left[\left\Vert\mathbb{E}_{\boldsymbol{x}_0\sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right] - \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2\right] \\ \leq &\,\mathbb{E}_{\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t)}\mathbb{E}_{\boldsymbol{x}_0\sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) - \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2\right] \\ = &\,\mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) - \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2\right] \\ \end{aligned}\end{equation}

The first equality is due to identity \eqref{eq:score-2}. The second inequality is due to the generalization of the mean square inequality or Jensen's inequality. The third equality follows from Bayes' theorem.

Equivalence Relation

A few days ago, during a discussion about Score Matching in a WeChat group, a fellow member pointed out: The difference between Conditional Score Matching and Score Matching is a constant independent of optimization, so the two are actually completely equivalent! When I first heard this conclusion, I was quite surprised; the two have an equivalence relationship, not just an upper/lower bound relationship. Not only that, after I tried to prove it, I found the proof process to be quite simple!

First, regarding Score Matching, we have:

Then, regarding Conditional Score Matching, we have:

\begin{equation}\begin{aligned} &\,\mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t\sim p_0(\boldsymbol{x}_0)p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) - \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2\right] \\[5pt] =&\,\mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t\sim p_0(\boldsymbol{x}_0)p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2 + \left\Vert\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2 - 2\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\cdot\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right] \\[5pt] =&\,\mathbb{E}_{\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t),\boldsymbol{x}_0\sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2 + \left\Vert\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2 - 2\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\cdot\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right] \\[5pt] =&\,\mathbb{E}_{\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t)}\left[{\begin{aligned}&\color{orange}{\mathbb{E}_{\boldsymbol{x}_0\sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right]} + \color{red}{\left\Vert\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2} \\ &\qquad\qquad- \color{green}{2\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\cdot\mathbb{E}_{\boldsymbol{x}_0\sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]}\end{aligned}}\right] \\[5pt] =&\,\mathbb{E}_{\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t)}\left[\color{orange}{\mathbb{E}_{\boldsymbol{x}_0\sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right]} + \color{red}{\left\Vert\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\right\Vert^2} - \color{green}{2\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\cdot\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)}\right] \\ \end{aligned}\end{equation}

Taking the difference between the two, we find the result is:

\begin{equation}\mathbb{E}_{\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t)}\left[\color{orange}{\mathbb{E}_{\boldsymbol{x}_0\sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right] - \left\Vert\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)\right\Vert^2}\right]\end{equation}

This is independent of the parameters $\boldsymbol{\theta}$. Therefore, minimizing the Score Matching objective and minimizing the Conditional Score Matching objective are theoretically equivalent. According to group members, this result first appeared in the article "A Connection Between Score Matching and Denoising Autoencoders".

Since the two are theoretically equivalent, does this mean our previous statement that "Score Matching" requires a larger batch size than "Conditional Score Matching" is incorrect? Not exactly. If we still estimate $\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)$ directly from equation \eqref{eq:score-1} and then perform Score Matching, the result is indeed still biased and depends on a large batch size. However, when we expand and further simplify the objective \eqref{eq:sm}, we gradually transform biased estimation into unbiased estimation, which then does not rely heavily on the batch size. In other words, although the two objectives are theoretically equivalent from the perspective of statistical quantities, they belong to different types of statistics; their equivalence is only an exact equivalence when the number of sampled points tends to infinity.

Summary

This article mainly analyzes the relationship between the two training objectives: "Score Matching" and "Conditional Score Matching."