By 苏剑林 | October 18, 2022
For many readers, the generative diffusion model may be the first model they have encountered that applies so many mathematical tools to deep learning. In this series of articles, we have demonstrated the profound connections between diffusion models and mathematical analysis, probability and statistics, ordinary differential equations (ODEs), stochastic differential equations (SDEs), and even partial differential equations (PDEs). It can be said that even students doing pure theoretical research in mathematical physics equations are likely to find a use for their skills within diffusion models.
In this article, we introduce another diffusion model that also has deep connections with mathematical physics—an ODE-based diffusion model inspired by the "Law of Universal Gravitation," from the paper "Poisson Flow Generative Models" (abbreviated as PFGM). It provides an entirely new perspective on constructing ODE-based diffusion models.
Universal Gravitation
In middle school, we learned the Law of Universal Gravitation, which is roughly described as:
The gravitational force between two point masses is proportional to the product of their masses and inversely proportional to the square of the distance between them.
Here, we ignore the mass and constants, focusing primarily on its direction and relationship with distance. Assuming the gravitational source is located at $\boldsymbol{y}$, the gravitational force experienced by an object at $\boldsymbol{x}$ can be denoted as:
\begin{equation}\boldsymbol{F}(\boldsymbol{x}) = -\frac{1}{4\pi}\frac{\boldsymbol{x} - \boldsymbol{y}}{\Vert \boldsymbol{x} - \boldsymbol{y}\Vert^3}\label{eq:grad-3}\end{equation}
We can ignore the factor $\frac{1}{4\pi}$ for now; it does not affect the subsequent analysis. To be precise, the above equation describes the gravitational field in three-dimensional space. For a $d$-dimensional space, the gravitational field takes the form:
\begin{equation}\boldsymbol{F}(\boldsymbol{x}) = -\frac{1}{S_d(1)}\frac{\boldsymbol{x} - \boldsymbol{y}}{\Vert \boldsymbol{x} - \boldsymbol{y}\Vert^d}\label{eq:grad-d}\end{equation}
where $S_d(1)$ is the surface area of a $d$-dimensional unit hypersphere. This formula is actually the gradient of the Green's function for the $d$-dimensional Poisson equation, which is where the word "Poisson" in the paper's title comes from.
Following Field Lines
If there are multiple gravitational sources, the total gravitational field is simply the sum of the individual gravitational fields, due to the linear superposition property of gravitational fields. Below is a conceptual illustration of a vector field with four gravitational sources (sources marked with black dots, field lines in color):
[Schematic of Gravitational Field]
From the gravitational field diagram above, we can observe an important characteristic:
Except for a few exceptions, most field lines originate from infinity and terminate at a gravitational source point.
At this point, an intuitive and "whimsical" idea is:
If each gravitational source represents a real sample point to be generated, wouldn't any arbitrary point from afar, as long as it moves along the field lines, eventually evolve into a real sample point?
This is the core brilliant idea of the Poisson Flow Generative Models paper!
Equivalent Center of Mass
Of course, a brilliant idea is just an idea; to turn it into a usable model, many details need to be supplemented. For instance, we just said "any arbitrary point from afar." This represents the initial distribution of the diffusion model, so the questions arise: How far is "afar"? How should "arbitrary points" be sampled? If the sampling method is too complex, it loses its value.
Fortunately, the gravitational field has a very important equivalence property:
A multi-source gravitational field at an infinite distance is equivalent to a single point-mass gravitational field located at the center of mass with the sum of all masses.
In other words, when the distance is sufficiently far, we only need to treat it as a single-source gravitational field at the center of mass. The diagram below shows a multi-source gravitational field and its corresponding center-of-mass gravitational field. As seen, when the distance increases (the position of the orange circle), the two gravitational fields become almost identical.
[Multi-source Field vs Center-of-Mass Field]
What are the characteristics of a single point-mass gravitational field? Isotropy! This means that at a sufficiently large radius, we can assume the field lines pass through the surface of a sphere centered at the point mass uniformly. Therefore, we can simply perform uniform sampling on a sphere with a sufficiently large radius, which solves the problem of sampling the initial distribution. As for how large "sufficiently large" is, we will discuss that later.
Mode Collapse
So, is the generative model constructed? Not yet. While the isotropy of the gravitational field makes the initial distribution easy to sample, it also causes a cancellation effect between gravitational sources, leading to "Mode Collapse."
Specifically, let's look at the gravitational field produced by a uniform distribution on a spherical shell:
[Gravitational Field of Isotropic Sources]
Notice the characteristic? Outside the shell, the distribution is normally isotropic, but inside the shell, it is "empty"! That is, the gravitational fields inside the shell cancel each other out, equivalent to a vacuum zone. The cancellation phenomenon implies that for any chosen sphere, if the gravitational sources are uniformly distributed on it, their forces cancel out, making the sources effectively non-existent. Since our generative model works by moving points from afar along field lines to reach a specific gravitational source, if sources cancel out, it means certain sources can never be reached, implying certain real samples cannot be generated. The result is a loss of diversity, which is the "Mode Collapse" phenomenon.
Adding a Dimension
It seems that mode collapse is unavoidable because, when constructing generative models, we typically assume that real samples follow a continuous distribution. Thus, for any sphere, even if the distribution of real samples on it isn't uniform, we could pick a uniform "subset," and the gravity of that subset would cancel out, effectively making those data points non-existent and causing mode collapse.
Is this the end of the road? No! This is where the second "brilliant idea" of PFGM comes in: Add one dimension!
We analyzed that mode collapse is unavoidable because the assumption of a continuous distribution makes isotropy unavoidable. To avoid mode collapse, we must find a way to prevent the distribution from being isotropic. But the distribution of real samples is the target distribution and cannot be changed. However, we can add a dimension to it. If we discuss this in a $d+1$ dimensional space, then the original $d$-dimensional distribution can be seen as a plane in $d+1$ dimensional space, and a plane cannot be isotropic. Take a low-dimensional example: in 2D, a "circle" is isotropic, but in 3D, a "sphere" is isotropic. A 2D isotropic "circle" is not isotropic when viewed from 3D space.
Therefore, assume the real samples to be generated were originally $\boldsymbol{x}\in\mathbb{R}^d$. We introduce a new dimension $t$, making the data points $(\boldsymbol{x},t)\in\mathbb{R}^{d+1}$. The original distribution of real samples was $\boldsymbol{x}\sim \tilde{p}(\boldsymbol{x})$; we now change it to $(\boldsymbol{x},t)\sim \delta(t)\tilde{p}(\boldsymbol{x})$, where $\delta(t)$ is the Dirac distribution. This effectively places the real samples on the $t=0$ plane in $d+1$ dimensional space. With this treatment, in $d+1$ dimensional space, the $t$ value of real sample points is always 0, preventing isotropy (analogous to the "circle in 3D" example).
Sudden Enlightenment
At first glance, adding a dimension looks like a minor mathematical trick, but upon reflection, it is exquisitely clever. Many details difficult to handle in the original $d$-dimensional space become clear in $d+1$ dimensional space.
According to Equation \eqref{eq:grad-d} and the linear superposition of gravity, we can write the gravitational field in $d+1$ dimensional space as:
\begin{equation}\begin{aligned}
\boldsymbol{F}(\boldsymbol{x}, t) =&\, -\frac{1}{S_{d+1}(1)}\iint\frac{(\boldsymbol{x} - \boldsymbol{x}_0, t - t_0)}{(\Vert\boldsymbol{x} - \boldsymbol{x}_0\Vert^2 + (t - t_0)^2)^{(d+1)/2}}\delta(t_0)\tilde{p}(\boldsymbol{x}_0) d\boldsymbol{x}_0dt_0 \\
=&\, -\frac{1}{S_{d+1}(1)}\int\frac{(\boldsymbol{x} - \boldsymbol{x}_0, t)}{(\Vert\boldsymbol{x} - \boldsymbol{x}_0\Vert^2 + t^2)^{(d+1)/2}}\tilde{p}(\boldsymbol{x}_0) d\boldsymbol{x}_0 \\
\triangleq&\, (\boldsymbol{F}_{\boldsymbol{x}}, \boldsymbol{F}_t)
\end{aligned}\label{eq:field}\end{equation}
where $\boldsymbol{F}_{\boldsymbol{x}}$ represents the first $d$ components of $\boldsymbol{F}(\boldsymbol{x}, t)$, and $\boldsymbol{F}_t$ is its $(d+1)$-th component. We will discuss how to learn $\boldsymbol{F}(\boldsymbol{x}, t)$ in the next section. For now, assuming $\boldsymbol{F}(\boldsymbol{x}, t)$ is known, we need to move along the field lines, meaning the trajectory must always be in the same direction as $\boldsymbol{F}(\boldsymbol{x}, t)$:
\begin{equation}(d\boldsymbol{x}, dt) = (\boldsymbol{F}_{\boldsymbol{x}}, \boldsymbol{F}_t) d\tau\quad\Rightarrow\quad \frac{d\boldsymbol{x}}{dt} = \frac{\boldsymbol{F}_{\boldsymbol{x}}}{\boldsymbol{F}_t}\label{eq:ode}\end{equation}
This is the ordinary differential equation (ODE) required for the generation process. In the previous $d$-dimensional scheme, besides mode collapse, termination was also a difficult detail to handle. Intuitively, one would move along field lines until "hitting" a real sample, but it was hard to judge when that happened. In the $d+1$ Dimensional scheme, we know all real samples are on the $t=0$ plane, so $t=0$ serves as a natural termination signal.
As for the initial distribution, following our previous discussion, it should be a "uniform distribution on a $d+1$ dimensional sphere with a sufficiently large radius." But since kita used $t=0$ as the termination signal, we might as well fix a sufficiently large $t=T$ (roughly in the range of $40 \sim 100$) and sample on the $t=T$ plane. Thus, the generation process becomes the movement defined by the ODE in \eqref{eq:ode} from $t=T$ to $t=0$, making both the beginning and end of the generation process quite clear.
Of course, sampling on a fixed $t=T$ plane won't be uniform. In fact:
\begin{equation}p_{prior}(\boldsymbol{x}) \propto \frac{1}{(\Vert\boldsymbol{x}\Vert^2 + T^2)^{(d+1)/2}}\end{equation}
The derivation is in the box below. As we can see, the probability density depends only on the length $\Vert\boldsymbol{x}\Vert$. Thus, the scheme for sampling from this distribution is to first sample the length according to a specific distribution, then sample the direction uniformly, and combine them. For the length sampling, let $r=\Vert\boldsymbol{x}\Vert$; by transforming to hyperspherical coordinates, we get $p_{prior}(r)\propto r^{d-1}(r^2 + T^2)^{-(d+1)/2}$, then use the inverse cumulative distribution function method for sampling (refer to "Spherical VAEs").
Derivation of the initial distribution: Field lines pass through the $d+1$ dimensional hypersphere uniformly, so the density at $(\boldsymbol{x}, T)$ is inversely proportional to $S_{d+1}(\boldsymbol{x}, T)$, i.e., $\propto \frac{1}{(\Vert\boldsymbol{x}\Vert^2 + T^2)^{d/2}}$. But now, instead of the sphere, we are on the $t=T$ plane, so we need to project the sphere onto the plane.
[Projection from Sphere to t=T Plane]
As shown in the figure above, when points $B$ and $D$ are sufficiently close, $\Delta OAB \sim \Delta BDC$, so:
\begin{equation}\frac{|BC|}{|BD|} = \frac{|OB|}{|OA|} = \frac{\sqrt{\Vert\boldsymbol{x}\Vert^2 + T^2}}{T}\end{equation}
This means that a unit arc length on the sphere, when projected onto the plane, increases in length by a factor of $\frac{\sqrt{\Vert\boldsymbol{x}\Vert^2 + T^2}}{T}$. Since only one dimension changes, the area element on the original sphere, when projected, also increases by a factor of $\frac{\sqrt{\Vert\boldsymbol{x}\Vert^2 + T^2}}{T}$. Therefore, according to the inverse relationship between probability and area, we get:
\begin{equation}p_{prior}(\boldsymbol{x}) \propto \frac{1}{S_{d+1}(\boldsymbol{x}, T)}\times \frac{T}{\sqrt{\Vert\boldsymbol{x}\Vert^2 + T^2}}\propto \frac{1}{(\Vert\boldsymbol{x}\Vert^2 + T^2)^{(d+1)/2}}\end{equation}
Training the Field
Now, we have the initial distribution and the ODE; all that remains is the training of the vector field function $\boldsymbol{F}(\boldsymbol{x}, t)$. As seen from the ODE in Equation \eqref{eq:ode}, it only depends on the relative values of the vector field; thus, scaling the vector field does not affect the final result. According to Equation \eqref{eq:field}, the vector field can be written as:
\begin{equation}\boldsymbol{F}(\boldsymbol{x}, t) = \mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0)}\left[-\frac{(\boldsymbol{x} - \boldsymbol{x}_0, t)}{(\Vert\boldsymbol{x} - \boldsymbol{x}_0\Vert^2 + t^2)^{(d+1)/2}}\right]\end{equation}
As we have used multiple times in previous articles, such as in the SDE and optimal variance estimation posts:
\begin{equation}\mathbb{E}_{\boldsymbol{x}}[\boldsymbol{x}] = \mathop{\text{argmin}}_{\boldsymbol{\mu}}\mathbb{E}_{\boldsymbol{x}}\left[\Vert \boldsymbol{x} - \boldsymbol{\mu}\Vert^2\right]\end{equation}
We can introduce a function $\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t)$ to learn $\boldsymbol{F}(\boldsymbol{x}, t)$, with the training objective:
\begin{equation}\mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0)}\left[\left\Vert\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t) + \frac{(\boldsymbol{x} - \boldsymbol{x}_0, t)}{(\Vert\boldsymbol{x} - \boldsymbol{x}_0\Vert^2 + t^2)^{(d+1)/2}}\right\Vert^2\right]\label{eq:loss}\end{equation}
However, the samples for $\boldsymbol{x},t$ in the above objective still need to be defined. This is one of the main features of PFGM: it directly defines the reverse process (generation) without needing to define a forward process; however, this sampling step effectively serves as the forward process. For this, the original paper considers perturbing each real sample to construct $\boldsymbol{x},t$ samples:
\begin{equation}\boldsymbol{x} = \boldsymbol{x}_0 + \Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert (1+\tau)^m \boldsymbol{u},\quad t = |\varepsilon_t| (1+\tau)^m\end{equation}
where $(\boldsymbol{\varepsilon}_{\boldsymbol{x}},\varepsilon_t)\sim\mathcal{N}(\boldsymbol{0}, \sigma^2 \boldsymbol{I}_{(d+1)\times(d+1)})$, $m\sim U[0,M]$, $\boldsymbol{u}$ is a unit vector uniformly distributed on a $d$-dimensional sphere, and $\tau, \sigma, M$ are constants. (Further discussion on this can be found in later posts).
Finally, the training objective in the original paper is slightly different from Equation \eqref{eq:loss} in this post, roughly corresponding to:
\begin{equation}\left\Vert\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t) + \text{Normalize}\left(\mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0)}\left[\frac{(\boldsymbol{x} - \boldsymbol{x}_0, t)}{(\Vert\boldsymbol{x} - \boldsymbol{x}_0\Vert^2 + t^2)^{(d+1)/2}}\right]\right)\right\Vert^2\end{equation}
In actual training, since only a finite number of $\boldsymbol{x}_0$ can be sampled to estimate the expectation inside the parentheses, this objective is actually a biased estimate. Of course, biased isn't necessarily worse than unbiased. Why exactly the original paper uses a biased estimate is not yet known, but I guess it's because the normalization of the vector might make the training process more stable. However, because it's a normalized biased estimate, it may require a larger batch size to be accurate, which increases experimental costs.
Experimental Results
PFGM is a thoroughly new framework. It no longer relies on Gaussian assumptions and results in a model with truly new connotations. However, we cannot be "new for the sake of being new"; if the new framework does not yield more convincing results, being new is meaningless.
Naturally, the experimental results in the paper confirm the value of PFGM, such as obtaining better evaluation metrics, faster generation speeds, and better robustness to hyperparameters (including model architecture). I won't display them all here; you can read the original paper. I saw that the original paper was accepted to NeurIPS 2022, and I have to say, it truly deserves to be a top-tier conference paper!
Official Github: https://github.com/Newbeeer/Poisson_flow
Article Summary
This article introduced an ODE-based diffusion model inspired by the "Law of Universal Gravitation." It breaks through the reliance of many previous diffusion models on Gaussian assumptions and is a brand-new framework based on field theory for constructing ODE-based diffusion models. The entire model is highly instructive and well worth careful study.