Generative Diffusion Models (14): General Steps for Constructing ODEs (Part 1)

By 苏剑林 | December 15, 2022

Continuing from where we left off, in "Generative Diffusion Models (13): From Universal Gravitation to Diffusion Models", we introduced an ODE-style generative diffusion model inspired by universal gravitation with a very clear geometric meaning. Some readers wondered: it seems "universal gravitation" is not the only choice; could other forms of force be used to build diffusion models from the same physical picture? Furthermore, while the model is physically intuitive, it still lacks a mathematical proof that it can indeed learn the data distribution.

This article attempts to answer the question "What kind of force field is suitable for constructing an ODE-style generative diffusion model?" from a more precise mathematical perspective.

Basic Conclusions

To answer this question, we need to use a conclusion regarding the distribution changes corresponding to ordinary differential equations, which we derived in "Generative Diffusion Models (12): Tackling Diffusion ODEs Head-on".

Consider a system of first-order (ordinary) differential equations for $\boldsymbol{x}_t \in \mathbb{R}^d, t \in [0, T]$: \begin{equation}\frac{d\boldsymbol{x}_t}{dt}=\boldsymbol{f}_t(\boldsymbol{x}_t)\label{eq:ode}\end{equation} It describes an (invertible) transformation from $\boldsymbol{x}_0$ to $\boldsymbol{x}_T$. If $\boldsymbol{x}_0$ is a random variable, then $\boldsymbol{x}_t$ throughout the process is also a random variable. The law of its distribution change can be described by the following equation: \begin{equation}\frac{\partial}{\partial t} p_t(\boldsymbol{x}_t) = - \nabla_{\boldsymbol{x}_t}\cdot\Big(\boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)\Big)\label{eq:ode-f-eq-fp}\end{equation} This result can be derived using the "Jacobian determinant + Taylor approximation" method as shown in "Generative Diffusion Models (12)", or by first deriving the full "Fokker-Planck equation" as in "Generative Diffusion Models (6): ODE Chapter within the General Framework" and setting $g_t=0$. Incidentally, equation $\eqref{eq:ode-f-eq-fp}$ is very famous in physics; it is called the "continuity equation" and is an embodiment of various conservation laws.

Returning to diffusion models, what they aim to do is construct a transformation that can convert samples from a simple distribution into samples from a target distribution. Using equation $\eqref{eq:ode-f-eq-fp}$, we can theoretically solve for a feasible $\boldsymbol{f}_t(\boldsymbol{x}_t)$ given a specified $p_t(\boldsymbol{x}_t)$, and then complete the generation process using equation $\eqref{eq:ode}$. Note that equation $\eqref{eq:ode-f-eq-fp}$ is just one equation, but $\boldsymbol{f}_t(\boldsymbol{x}_t)$ to be solved has $d$ components, so this is an indeterminate equation. In principle, we can arbitrarily specify the complete $p_t(\boldsymbol{x}_t)$ (and not just the boundaries at $t=0, T$) to solve for $\boldsymbol{f}_t(\boldsymbol{x}_t)$.

So, theoretically, constructing an ODE-style diffusion model is simply solving a very relaxed indeterminate equation with almost no constraints. This is true, but the problem is that solutions derived this way may be difficult in practice—specifically, difficult to implement in code. Therefore, the accurate phrasing of the problem is how to solve for more practical solutions from equation $\eqref{eq:ode-f-eq-fp}$.

Simplifying the Equation

Notice that equation $\eqref{eq:ode-f-eq-fp}$ can be rewritten as: \begin{equation}\underbrace{\left(\frac{\partial}{\partial t}, \nabla_{\boldsymbol{x}_t}\right)}_{\nabla_{(t,\, \boldsymbol{x}_t)}}\cdot \underbrace{\Big(p_t( \boldsymbol{x}_t), \boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)\Big)}_{\boldsymbol{u}\in\mathbb{R}^{d+1}}=0\end{equation} As shown above, $\left(\frac{\partial}{\partial t}, \nabla_{\boldsymbol{x}_t}\right)$ can be treated exactly as a $(d+1)$-dimensional gradient $\nabla_{(t,\, \boldsymbol{x}_t)}$, and $\big(p_t( \boldsymbol{x}_t), \boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)\big)$ happens to form a $(d+1)$-dimensional vector $\boldsymbol{u}(t, \boldsymbol{x}_t)$. Thus, $\eqref{eq:ode-f-eq-fp}$ can be written as a simple divergence equation: \begin{equation}\nabla_{(t,\, \boldsymbol{x}_t)}\cdot\boldsymbol{u}(t, \boldsymbol{x}_t)=0\label{eq:div-eq}\end{equation} In this form, we have: \begin{equation}\frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{f}_t(\boldsymbol{x}_t) = \frac{\boldsymbol{u}_{> 1}(t, \boldsymbol{x}_t)}{\boldsymbol{u}_1(t, \boldsymbol{x}_t)}\label{eq:div-eq-ode}\end{equation} where $\boldsymbol{u}_1$ and $\boldsymbol{u}_{> 1}$ represent the first component and the subsequent $d$ components of $\boldsymbol{u}$, respectively. Of course, one must not forget the constraints: \begin{equation}\left\{\begin{aligned} &\boldsymbol{u}_1(0, \boldsymbol{x}_0) = p_0(\boldsymbol{x}_0)\quad&(\text{Initial condition}) \\[5pt] &\int \boldsymbol{u}_1(t, \boldsymbol{x}_t) d\boldsymbol{x}_t = 1\quad&(\text{Integral condition}) \end{aligned}\right.\end{equation} where $p_0(\boldsymbol{x}_0)$ is the data distribution, i.e., the target sample distribution to be generated. Regarding the terminal distribution at $t=T$, our requirement is only that it be as simple as possible to facilitate sampling; since there is no other quantitative requirement, we do not need to write it out for now.

Green's Function

After this transformation, we can view $\boldsymbol{u}(t, \boldsymbol{x}_t)$ as a $(d+1)$-dimensional vector field, and the differential equation $\eqref{eq:div-eq-ode}$ exactly describes the trajectory of a particle moving along the field lines. This matches the physical picture given in "Generative Diffusion Models (13): From Universal Gravitation to Diffusion Models".

To find the general solution for $\boldsymbol{u}(t, \boldsymbol{x}_t)$, we can use the idea of Green's functions. First, we attempt to solve the following problem: \begin{equation}\left\{\begin{aligned} &\nabla_{(t,\, \boldsymbol{x}_t)}\cdot\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)=0\\ &\boldsymbol{G}_1(0, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_0),\int \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) d\boldsymbol{x}_t = 1 \end{aligned}\right.\label{eq:div-green}\end{equation} It is easy to prove that if the above holds, then: \begin{equation}\boldsymbol{u}(t, \boldsymbol{x}_t) = \int \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 = \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)}[\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)]\label{eq:div-green-int}\end{equation} will be a solution to equation $\eqref{eq:div-eq}$ satisfying the corresponding constraints. In this way, we represent $\boldsymbol{u}(t, \boldsymbol{x}_t)$ as an expectation form over training samples, which is beneficial for model training. It is not hard to see that $\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ here is actually the conditional probability $p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$ in diffusion models.

In fact, $\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ as defined in equation $\eqref{eq:div-green}$ is not a Green's function in the usual sense. General Green's functions refer to solutions under a point source, whereas here the "point source" of the Green's function is placed at the boundary. Nevertheless, the defined $\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ still possesses properties similar to conventional Green's functions; it itself corresponds to the "force field" generated by a point source, and equation $\eqref{eq:div-green-int}$ is precisely the integration of fields from point sources to find the field of a continuous source distribution.

Universal Gravitation

Now we use the above framework to solve for some specific results. As mentioned, equations $\eqref{eq:div-eq}$ or $\eqref{eq:div-green}$ are indeterminate equations with "$d+1$ unknowns and one equation." Theoretically, there are infinitely many diverse solutions. To solve it, we instead need to introduce some additional assumptions to make the solution more definite. The first solution is based on the isotropy assumption, which corresponds exactly to the result in "Generative Diffusion Models (13)".

Assumption and Solution

Note that "isotropy" here refers to isotropy in the $(d+1)$-dimensional space composed of $(t, \boldsymbol{x}_t)$. This means $\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ points towards the source point $(0, \boldsymbol{x}_0)$, and its magnitude depends only on $R = \sqrt{(t-0)^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2}$. Therefore, we can set: \begin{equation}\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \varphi(R)(t, \boldsymbol{x}_t - \boldsymbol{x}_0)\end{equation} Then: \begin{equation}\begin{aligned} 0 =&\, \nabla_{(t,\, \boldsymbol{x}_t)}\cdot\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) \\ =&\, \nabla_{(t,\, \boldsymbol{x}_t)}\varphi(R)\cdot(t, \boldsymbol{x}_t - \boldsymbol{x}_0) + \varphi(R)\nabla_{(t,\, \boldsymbol{x}_t)}\cdot (t, \boldsymbol{x}_t - \boldsymbol{x}_0) \\ =&\, \varphi'(R) \frac{(t, \boldsymbol{x}_t - \boldsymbol{x}_0)}{R}\cdot(t, \boldsymbol{x}_t - \boldsymbol{x}_0) + (d+1)\varphi(R)\\ =&\, \varphi'(R) R + (d+1)\varphi(R) \\ =&\,\frac{[\varphi(R)R^{d+1}]'}{R^d} \end{aligned}\end{equation} That is $[\varphi(R)R^{d+1}]'=0$, or $\varphi(R)R^{d+1}=C$, i.e., $\varphi(R)=C\times R^{-(d+1)}$. Thus, a candidate solution is: \begin{equation}\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = C\times\frac{(t, \boldsymbol{x}_t - \boldsymbol{x}_0)}{\left(t^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2\right)^{(d+1)/2}}\end{equation}

Constraint Check

As can be seen, under the isotropy assumption, the universal gravitation solution is the unique solution. To prove it is a feasible solution, we must check the constraints. The key one is: \begin{equation}\int\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) d\boldsymbol{x}_t = C\times \int\frac{t}{\left(t^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2\right)^{(d+1)/2}}d\boldsymbol{x}_t\end{equation} In fact, we only need to verify that the integral result has no relationship with $t$ or $\boldsymbol{x}_0$; then we can choose an appropriate constant $C$ to make the integral result equal to 1. For $t > 0$, we can check by performing the variable substitution $\boldsymbol{z} = (\boldsymbol{x}_t - \boldsymbol{x}_0) / t$. Since $\boldsymbol{x}_t$ covers the entire space, $\boldsymbol{z}$ also covers the entire space. Substituting this into the above gives: \begin{equation}\int\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) d\boldsymbol{x}_t = C\times \int\frac{1}{\left(1 + \Vert \boldsymbol{z}\Vert^2\right)^{(d+1)/2}}d\boldsymbol{z}\label{eq:pz}\end{equation} It can now be seen that the integral result is independent of $t$ and $\boldsymbol{x}_0$. Thus, as long as an appropriate $C$ is chosen, the constraint that the integral equals 1 is passed. Henceforth, we assume $C$ has been chosen to make the integral 1.

As for the initial value, we need to verify $\lim\limits_{t\to 0^+}\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_0)$. This only requires checking against the definition of the Dirac delta function:

1. When $\boldsymbol{x}_t\neq \boldsymbol{x}_0$, the limit is clearly 0;
2. When $\boldsymbol{x}_t = \boldsymbol{x}_0$, the limit is clearly $\infty$;
3. We just verified that the integral of $\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ over $\boldsymbol{x}_t$ is constantly 1.

These three points are exactly the basic properties of the Dirac delta function, or one could even say they define it. Therefore, the initial value check also passes.

Analysis of Results

Now, according to equation $\eqref{eq:div-green-int}$, we have: \begin{equation}\boldsymbol{u}(t, \boldsymbol{x}_t) = C\times\mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)}\left[\frac{(t, \boldsymbol{x}_t - \boldsymbol{x}_0)}{\left(t^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2\right)^{(d+1)/2}}\right]\end{equation} Next, one simply constructs a score-matching-like objective for learning using $\mathbb{E}_{\boldsymbol{x}}[\boldsymbol{x}] = \mathop{\text{argmin}}_{\boldsymbol{\mu}}\mathbb{E}_{\boldsymbol{x}}\left[\Vert \boldsymbol{x} - \boldsymbol{\mu}\Vert^2\right]$. This process has been discussed many times and won't be repeated here.

As mentioned earlier, $\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ is actually $p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$. We now know its specific form is: \begin{equation}p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\propto \frac{t}{\left(t^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2\right)^{(d+1)/2}}\end{equation} When $t=T$ is large enough, the influence of $\boldsymbol{x}_0$ becomes negligible, i.e., $p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$ degenerates into a prior distribution independent of $\boldsymbol{x}_0$: \begin{equation}p_{prior}(\boldsymbol{x}_T) \propto \frac{T}{(T^2 + \Vert\boldsymbol{x}_T\Vert^2)^{(d+1)/2}}\end{equation} Previously, in "Generative Diffusion Models (13)", deriving this result was quite tedious, but in this framework, it follows naturally. Not only that, now that we have $p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$, we can theoretically perform the sampling $\boldsymbol{x}_t\sim p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)$. From the derivation of equation $\eqref{eq:pz}$, we know that if we use the substitution $\boldsymbol{z} = (\boldsymbol{x}_t - \boldsymbol{x}_0) / t$, then: \begin{equation}p(\boldsymbol{z}) \propto \frac{1}{\left(1 + \Vert \boldsymbol{z}\Vert^2\right)^{(d+1)/2}}\label{eq:pz-2}\end{equation} Thus, we can first sample from $p(\boldsymbol{z})$ and then obtain the corresponding $\boldsymbol{x}_t$ via $\boldsymbol{x}_t = \boldsymbol{x}_0 + t\, \boldsymbol{z}$. As for sampling from $p(\boldsymbol{z})$, it only depends on the magnitude, so we can use the inverse cumulative distribution function method to sample the magnitude and then randomly sample a direction to form the sampling result—this is identical to sampling from the prior distribution. However, while further researching the leftover issues below, I discovered a surprising "pleasant surprise"!

Revisiting the Problem

In "Generative Diffusion Models (13)", we noted that the sampling scheme given in the original paper was: \begin{equation}\boldsymbol{x}_t = \boldsymbol{x}_0 + \Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert (1+\tau)^m \boldsymbol{u},\quad t = |\varepsilon_t| (1+\tau)^m\end{equation} where $(\boldsymbol{\varepsilon}_{\boldsymbol{x}},\varepsilon_t)\sim\mathcal{N}(\boldsymbol{0}, \sigma^2\boldsymbol{I}_{(d+1)\times(d+1)})$, $m\sim U[0,M]$, $\boldsymbol{u}$ is a unit vector uniformly distributed on the $d$-dimensional unit sphere, and $\tau,\sigma,M$ are constants. At the time, the comment on this sampling was that it had "considerable subjectivity," meaning it felt like it was subjectively designed by the authors without much justification. However, whether the authors intended it or not, I found a miraculous "coincidence": this sampling is exactly an implementation of equation $\eqref{eq:pz-2}$!

Let's prove this. First, we substitute the second half of the above equation into the first half to get: \begin{equation}\boldsymbol{x}_t = \boldsymbol{x}_0 + t\times \frac{\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert}{|\varepsilon_t|} \boldsymbol{u}\end{equation} Formally, it is already identical to $\boldsymbol{x}_t = \boldsymbol{x}_0 + t\, \boldsymbol{z}$ mentioned in the previous section, and $\boldsymbol{u}$ is also an isotropic unit random vector. So the question becomes whether $\frac{\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert}{|\varepsilon_t|}$ has the same distribution as $\Vert\boldsymbol{z}\Vert$. The answer is yes! Note that when the probability density changes from Cartesian to spherical coordinates, it must be multiplied by $\text{radius}^{d-1}$. So, according to equation $\eqref{eq:pz-2}$, we have: \begin{equation}p(\Vert\boldsymbol{z}\Vert) \propto \frac{\Vert \boldsymbol{z}\Vert^{d-1}}{\left(1 + \Vert \boldsymbol{z}\Vert^2\right)^{(d+1)/2}}\label{eq:pz-3}\end{equation} Now, given $(\boldsymbol{\varepsilon}_{\boldsymbol{x}},\varepsilon_t)\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}_{(d+1)\times(d+1)})$ (since we are looking at a ratio, the variance cancels out, so we take $\sigma=1$ for simplicity): \begin{equation}p(\Vert\boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert) \propto \Vert\boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert^{d-1} e^{-\Vert\boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert^2/2}, \quad p(|\varepsilon_t|) \propto e^{-|\varepsilon_t|^2/2}\end{equation} Let $r = \frac{\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert}{|\varepsilon_t|}$, then $\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert=r|\varepsilon_t|$. Then, based on the equality of probabilities: \begin{equation}\begin{aligned} p(r)dr =&\, \mathbb{E}_{|\varepsilon_t|\sim p(|\varepsilon_t|)}\big[p(\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert\color{red}{=r|\varepsilon_t|)}d(\color{red}{r|\varepsilon_t|})\big] \\[5pt] \propto&\, \mathbb{E}_{|\varepsilon_t|\sim p(|\varepsilon_t|)}\big[r^{d-1}|\varepsilon_t|^d e^{-r^2|\varepsilon_t|^2/2} dr\big] \\[5pt] \propto&\, \int_0^{\infty} r^{d-1}|\varepsilon_t|^d e^{-r^2|\varepsilon_t|^2/2} e^{-|\varepsilon_t|^2/2} d|\varepsilon_t| dr \\ =&\, \int_0^{\infty} r^{d-1}|\varepsilon_t|^d e^{-(r^2+1)|\varepsilon_t|^2/2} d|\varepsilon_t| dr \\ =&\, \frac{r^{d-1}}{(1+r^2)^{(d+1)/2}} \int_0^{\infty} s^d e^{-s^2/2} ds dr \quad\left(\text{let }s = |\varepsilon_t|\sqrt{r^2+1}\right) \\ \propto&\, \frac{r^{d-1}}{(1+r^2)^{(d+1)/2}} dr \end{aligned}\end{equation} Therefore $p(r)\propto \frac{r^{d-1}}{(1+r^2)^{(d+1)/2}}$, which is perfectly consistent with $\eqref{eq:pz-3}$. Thus, $\frac{\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert}{|\varepsilon_t|}\boldsymbol{u}$ indeed provides an effective sampling method for $\boldsymbol{z}$, which is much easier to implement than the inverse cumulative distribution function method, though the original paper did not mention this.

Space-Time Separation

We just solved for the isotropic solution in the $(d+1)$-dimensional space composed of $(t, \boldsymbol{x}_t)$. In a sense, this can be considered the simplest solution. This statement might be hard for some readers to accept, as the universal gravitation diffusion model looks significantly more complex mathematically. However, in solving mathematical physics equations, isotropic solutions are indeed often tested first as the simplest cases.

Of course, viewing the $(t, \boldsymbol{x}_t)$ "space-time" as a whole for isotropy is not as intuitive in our understanding. We are more accustomed to understanding isotropy in the spatial dimensions while keeping the time dimension separate. This section solves under this assumption.

Assumption and Solution

That is to say, "isotropy" here refers to isotropy within the $d$-dimensional space of $\boldsymbol{x}_t$. $\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ is decomposed into two parts: $(\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0), \boldsymbol{G}_{> 1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0))$. Here $\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ is a scalar; isotropy means it only depends on $r = \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert$, which we denote as $\phi_t(r)$. $\boldsymbol{G}_{> 1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ is a $d$-dimensional vector; isotropy means $\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ points towards the source $\boldsymbol{x}_0$ and its magnitude depends only on $r = \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert$. Thus, we can set: \begin{equation}\boldsymbol{G}_{>1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \varphi_t(r)(\boldsymbol{x}_t - \boldsymbol{x}_0)\end{equation} Then: \begin{equation}\begin{aligned} 0 =&\, \frac{\partial}{\partial t}\phi_t(r) + \nabla_{\boldsymbol{x}_t}\cdot(\varphi_t(r) (\boldsymbol{x}_t - \boldsymbol{x}_0)) \\ =&\, \frac{\partial}{\partial t}\phi_t(r) + r\frac{\partial}{\partial r}\varphi_t(r) + d\, \varphi_t(r) \\ =&\, \frac{\partial}{\partial t}\phi_t(r) + \frac{1}{r^{d-1}}\frac{\partial}{\partial r}\big(\varphi_t(r) r^d\big)\\ \end{aligned}\end{equation} There are two unknown functions $\phi_t(r), \varphi_t(r)$, but only one equation, making the solving even easier. Since the constraints apply to $\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)$ (which is $\phi_t(r)$) rather than $\varphi_t(r)$, it is common practice to give a $\phi_t(r)$ that satisfies the conditions and solve for $\varphi_t(r)$. The result is: \begin{equation}\varphi_t(r) = -\frac{1}{r^d}\int \frac{\partial}{\partial t}\phi_t(r) r^{d-1} dr = -\frac{1}{r^d}\frac{\partial}{\partial t}\int \phi_t(r) r^{d-1} dr\label{eq:f-g-t-r}\end{equation}

Gaussian Diffusion

This part demonstrates that the common ODE diffusion model based on Gaussian distribution assumptions is also a special case of equation $\eqref{eq:f-g-t-r}$. For the Gaussian assumption, we have: \begin{equation}\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{1}{(2\pi\sigma_t^2)^{d/2}} e^{-\Vert\boldsymbol{x}_t-\boldsymbol{x}_0\Vert^2/2\sigma_t^2}\end{equation} i.e., $\phi_t(r) = \frac{1}{(2\pi\sigma_t^2)^{d/2}} e^{-r^2/2\sigma_t^2}$, where $\sigma_t$ is a monotonically increasing function of $t$, satisfying $\sigma_0=0$ and $\sigma_T$ being large enough. $\sigma_0=0$ is to satisfy the initial condition, and $\sigma_T$ large enough makes the prior independent of data. As for the constraint that the integral equal 1, this is a fundamental property of Gaussian distributions and is naturally satisfied.

Substituting into equation $\eqref{eq:f-g-t-r}$ yields: \begin{equation}\varphi_t(r) = \frac{\dot{\sigma}_t}{(2\pi\sigma_t^2)^{d/2}\sigma_t} e^{-r^2/2\sigma_t^2} = \frac{\dot{\sigma}_t}{\sigma_t}\phi_t(r)\end{equation} (The integral over $r$ involves the incomplete Gamma function and is complex; I used Mathematica to calculate it). With this result, we have: \begin{equation}\begin{aligned} \boldsymbol{u}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) =&\, \int p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 = p_t(\boldsymbol{x}_t) \\ \boldsymbol{u}_{> 1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) =&\, \int \frac{\dot{\sigma}_t}{\sigma_t}(\boldsymbol{x}_t - \boldsymbol{x}_0)p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 \\ =&\, -\dot{\sigma}_t\sigma_t \int\nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 \\ =&\, -\dot{\sigma}_t\sigma_t \nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t) \\ \end{aligned}\end{equation} Thus, according to equation $\eqref{eq:div-eq-ode}$, we have: \begin{equation}\boldsymbol{f}_t(\boldsymbol{x}_t) = \frac{\boldsymbol{u}_{> 1}(t, \boldsymbol{x}_t)}{\boldsymbol{u}_1(t, \boldsymbol{x}_t)} = -\dot{\sigma}_t\sigma_t \nabla_{\boldsymbol{x}_t} \log p_t(\boldsymbol{x}_t) \end{equation} These results are perfectly consistent with "Generative Diffusion Models (12)". For further details on handling this, you can refer to that article.

Reverse Construction

The method of giving $\phi_t(r)$ to solve for $\varphi_t(r)$ as just described is theoretically simple, but practically faces two difficulties: 1. $\phi_t(r)$ must satisfy both the initial value condition and the integral condition, which is not easy to construct; 2. The integral over $r$ does not necessarily have a simple elementary form. Given this, we can consider a reverse construction method.

We know that $\phi_t(r)$ is the probability density in Cartesian coordinates; changing to spherical coordinates requires multiplying by $C_d r^{d-1}$, where $C_d$ is some constant (dependent on $d$). Since the final result according to equation $\eqref{eq:div-eq-ode}$ is a ratio, it is not affected by constants. For simplicity, we ignore this constant. After ignoring it, this is exactly the integrand in equation $\eqref{eq:f-g-t-r}$, so the integral in $\eqref{eq:f-g-t-r}$: \begin{equation}\int \phi_t(r) r^{d-1} dr\end{equation} is exactly a cumulative probability function (more accurately, $1/C_d$ of the cumulative distribution function plus a constant, but we have ignored the irrelevant constant). While calculating cumulative probability from density is not always easy, calculating density from cumulative probability is simple (differentiation). Therefore, we can first construct the cumulative probability function and then find the corresponding $\phi_t(r), \varphi_t(r)$, thereby avoiding the difficulty of integration.

Specifically, we construct a cumulative probability function $\psi_t(r)$ that satisfies:

1. $\psi_t(0)=0, \psi_t(\infty)=1$;
2. $\psi_t(r)$ is monotonically increasing with respect to $r$;
3. $\forall r > 0, \lim\limits_{t\to 0^+} \psi_t(r)=1$.
Students who have studied activation functions slightly should find it easy to construct functions satisfying these conditions; it is essentially a smooth approximation of the "step function," such as $\tanh\left(\frac{r}{t}\right)$, $1-e^{-r/t}$, etc. With $\psi_t(r)$, based on equation $\eqref{eq:f-g-t-r}$, we have: \begin{equation}\phi_t(r) = \frac{1}{r^{d-1}}\frac{\partial}{\partial r}\psi_t(r), \quad \varphi_t(r) = -\frac{1}{r^d}\frac{\partial}{\partial t}(\psi_t(r)\color{skyblue}{+\lambda_t})\end{equation} where $\color{skyblue}{\lambda_t}$ is any function of $t$, which can generally be set to 0. Of course, all these isotropic solutions are essentially equivalent, including the "universal gravitation diffusion" derived earlier; they can all be incorporated into the above equation and derived from each other via coordinate transformations. This is because the above equation only depends on a univariate cumulative probability function $\psi_t(r)$, and cumulative probability functions between different distributions can generally be transformed into each other (they are all well-behaved monotonically increasing functions).

Summary

This article constructs a general framework for ODE-style diffusion. Theoretically, all ODE-style diffusion models can be incorporated into this framework. We can also derive various novel or even bizarre ODE-style diffusion models from it. For example, while the current derivations are based on the isotropy assumption, one could replace the isotropic $\varphi(R)$ with a more general $\varphi(t; \boldsymbol{x}_t, \boldsymbol{x}_0)$. This could be solved using the "method of characteristics for first-order partial differential equations," yielding a new family of models. Overall, this is a veritable "production workshop" for ODE-style diffusion models.

Some readers might ask: I just want a usable generative diffusion model; what's the value in creating so many flashy variations? In fact, just like "Introduction to f-GAN: Production Workshop for GAN Models" and "Designing GANs: Another GAN Production Workshop", we hope to discover and master the laws of constructing generative models to further understand the keys to generation. This is a quest for optimization and discovery of more effective models—an endless pursuit of perfection.

Experimental results in the original "Universal Gravitation Diffusion" paper have shown that, as an ODE-style diffusion model, it performs better than Gaussian diffusion. This suggests that even under the isotropy assumption, these mathematically equivalent diffusion models still exhibit performance differences in practice. Therefore, how to better combine experimental details to answer "what kind of design makes a better diffusion model" will be a very meaningful research question in the future.