Generative Diffusion Model Chat (30): From Instantaneous Velocity to Average Velocity

By 苏剑林 | May 26, 2025

As is well known, slow generation speed has always been a pain point for diffusion models. To solve this problem, researchers have "shown their prowess" by proposing a variety of solutions. However, for a long time, no single work has managed to stand out as a standard. What kind of work would reach such a standard? In my view, it must satisfy at least a few conditions:

1. The mathematical principles are clear, revealing the essence of fast generation;
2. It can be trained from scratch with a single objective, without requiring additional means like GANs or distillation;
3. Single-step generation is close to SOTA, and quality can be improved by increasing the number of steps.

According to my reading experience, almost no work has met all three criteria simultaneously. However, a few days ago, an arXiv paper titled "Mean Flows for One-step Generative Modeling" (referred to as "MeanFlow") appeared, which seems very promising. Next, we will use this as an opportunity to discuss related ideas and progress.

Existing Approaches

There have been many works on accelerating diffusion model generation, some of which have been briefly introduced in this blog before. In general, acceleration strategies can be categorized into three types.

First, converting the diffusion model into an SDE/ODE and researching more efficient solvers. The representative work is DPM-Solver and its series of subsequent improvements. However, this approach generally can only reduce the NFE (Number of Function Evaluations) to around 10; going lower significantly reduces generation quality. This is because the convergence speed of solvers is usually proportional to some power of the step size; when the NFE is very small, the step size cannot be small enough for the solver to converge sufficiently.

Second, converting a pre-trained diffusion model into a generator with fewer steps through distillation. Derived works and schemes are numerous; we previously introduced a scheme called SiD. Distillation is a regular and general idea, but its drawback is shared by all: it requires extra training costs and is not a "train-from-scratch" solution. Some works, to achieve distillation into a one-step generator, even add adversarial training and multiple other optimization strategies, making the entire scheme overly complex.

Third, approaches based on Consistency Models (CM), including the CM we introduced in "Generative Diffusion Model Chat (28): Understanding Consistency Models Step by Step", its continuous version sCM, and CTM, etc. CM is a unique approach that allows for models with very small NFE to be trained from scratch or used for distillation. However, the objective of CM depends on EMA or stop_gradient operations, meaning it is coupled with optimizer dynamics, which often gives a vague and obscure feeling.

Instantaneous Velocity

So far, diffusion models with the smallest NFE are essentially ODEs, as deterministic models are often easier to analyze and solve. This article also focuses only on ODE-type diffusion, using the framework of ReFlow introduced in "Generative Diffusion Model Chat (17): General Steps for Building ODEs (Part 2)", which is essentially consistent with Flow Matching but more intuitive.

ODE-type diffusion aims to learn an ODE \begin{equation}\frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\label{eq:ode}\end{equation} to construct a transformation $\boldsymbol{x}_1 \to \boldsymbol{x}_0$. Specifically, let $\boldsymbol{x}_1 \sim p_1(\boldsymbol{x}_1)$ be some easily sampled random noise, and $\boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0)$ be real samples from the target distribution. We hope to achieve the transformation from random noise to target samples via the aforementioned ODE. By sampling a random $\boldsymbol{x}_1 \sim p_1(\boldsymbol{x}_1)$ as an initial value, the $\boldsymbol{x}_0$ obtained by solving the ODE will be a sample of $p_0(\boldsymbol{x}_0)$.

If we view $t$ as time and $\boldsymbol{x}_t$ as displacement, then $d\boldsymbol{x}_t/dt$ is the instantaneous velocity. Thus, ODE-type diffusion is the modeling of instantaneous velocity. How do we train $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$? ReFlow proposes a very intuitive method: first construct an arbitrary interpolation between $\boldsymbol{x}_0$ and $\boldsymbol{x}_1$, such as the simplest linear interpolation $\boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t\boldsymbol{x}_1$. Then, differentiating with respect to $t$ gives \begin{equation}\frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{x}_1 - \boldsymbol{x}_0\end{equation} This is an extremely simple ODE, but it does not meet our requirements because $\boldsymbol{x}_0$ is our target and should not appear in the ODE. To address this, ReFlow proposes a very intuitive idea — use $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$ to approximate $\boldsymbol{x}_1 - \boldsymbol{x}_0$: \begin{equation}\mathbb{E}_{t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\Vert\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\Vert^2\right]\label{eq:loss-reflow}\end{equation} This is the objective function of ReFlow. It is worth noting that: 1) ReFlow theoretically allows any interpolation method for $\boldsymbol{x}_0$ and $\boldsymbol{x}_1$; 2) Although intuitive, ReFlow is theoretically rigorous, and it can be proven that its optimal solution is indeed our desired ODE. For details, please refer to "Generative Diffusion Model Chat (17): General Steps for Building ODEs (Part 2)" and the original paper.

Average Velocity

However, an ODE is merely a pure mathematical form; actual solution requires discretization, such as the simplest Euler method: \begin{equation}\boldsymbol{x}_{t - \Delta t} = \boldsymbol{x}_t - \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) \Delta t\end{equation} The NFE from 1 to 0 is $1/\Delta t$. Wanting a small NFE is equivalent to a large $\Delta t$. However, the theoretical basis of ReFlow is an exact ODE, meaning target sample generation is achieved only when solving the ODE exactly. this implies $\Delta t$ should be as small as possible, which contradicts our expectation. Although ReFlow claims that using straight-line interpolation makes the ODE trajectories straighter, allowing for larger $\Delta t$, the actual trajectories are ultimately curved. It is difficult for $\Delta t$ to approach 1, so ReFlow struggles with one-step generation.

Fundamentally, an ODE is a concept where $\Delta t \to 0$. We are forcing it to be used for $\Delta t \to 1$ and expecting good results, which is essentially "making things difficult for the model." Therefore, changing the modeling target, rather than continuing to "pressurize" the model, is the essential path to faster generation. To this end, we consider integrating both sides of Eq. $\eqref{eq:ode}$: \begin{equation}\boldsymbol{x}_t - \boldsymbol{x}_r = \int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau = (t-r)\times \frac{1}{t-r}\int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau\end{equation} If we can model \begin{equation} \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) \triangleq \frac{1}{t-r}\int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau\end{equation} then we have $\boldsymbol{x}_0 = \boldsymbol{x}_1 - \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_1, 0, 1)$. Theoretically, this allows us to precisely achieve one-step generation without resorting to approximate relationships. If $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$ is the instantaneous velocity at time $t$, then clearly $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$ is the average velocity over the time interval $[r,t]$. In other words, to accelerate or even achieve one-step generation, our modeling target should be the average velocity rather than the instantaneous velocity of the ODE.

Identity Transformation

Of course, the shift from instantaneous to average velocity is not hard to conceive; the truly difficult part is how to construct a loss function for it. ReFlow only tells us how to build a loss function for instantaneous velocity; we know nothing about training the average velocity.

The next natural idea is to "turn the unknown into the known," i.e., using the average velocity $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$ as a starting point to construct the instantaneous velocity $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$, and then substituting it into the ReFlow objective function. This requires us to derive the identity transformation between the two. From the definition of $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$, we get \begin{equation} \int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau = (t-r)\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) \end{equation} Differentiating both sides with respect to $t$, we obtain \begin{equation}\begin{aligned} \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) =&\, \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\frac{d}{dt}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) \\ =&\, \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] \end{aligned}\label{eq:id1}\end{equation} This is the first identity relationship between $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$ and $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$. Where there is a first, there is a second; the second identity is derived from the definition of average velocity: \begin{equation}\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) = \lim_{r\to t}\frac{1}{t-r}\int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau = \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)\label{eq:id2}\end{equation} Simply put, the average velocity over an infinitesimally small interval equals the instantaneous velocity.

The First Objective

Based on $d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$ and identity $\eqref{eq:id2}$, we can replace $d\boldsymbol{x}_t/dt$ in identity $\eqref{eq:id1}$ with $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$ or $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)$. The former is an implicit relationship, which we will discuss later; let's look at the latter first. We then have:

\begin{equation}\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) = \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right]\end{equation}

Substituting this into ReFlow, we get the first objective function that can be used to train $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$:

\begin{equation}\mathbb{E}_{r,t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right]\label{eq:loss-1}\end{equation}

This is a very ideal result, satisfying all our expectations for a generative model's objective function:

1. A single explicit minimization goal;
2. No operations like EMA or stop_gradient;
3. Theoretically guaranteed (via ReFlow).

These properties mean that no matter what optimization algorithm we use, as long as we can find the minimum of the above equation, it is our desired average velocity model — a generative model that theoretically enables one-step generation. In other words, it possesses the simplicity of training and theoretical guarantees of diffusion models, while achieving one-step generation like a GAN, without needing to pray to the gods that the model doesn't "lose its way" and collapse during training.

JVP Operation

However, for some readers, implementing objective $\eqref{eq:loss-1}$ might be slightly difficult because it involves the "Jacobian-Vector Product (JVP)," which is relatively uncommon for average users. Specifically, we can write the part inside the square brackets as:

\begin{equation}\underbrace{\left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t),0,1\right] \\[10pt]}_{\text{Vector}}\cdot\underbrace{\left[\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t), \frac{\partial}{\partial r}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t), \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] \\[10pt]}_{\text{Jacobian Matrix}}\end{equation}

This is the multiplication of the Jacobian matrix of $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$ with the given vector $[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t),0,1]$. The result is a vector of the same size as $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$. This operation is called JVP, and there are ready-made implementations in Jax and torch. For example, Jax reference code is:

u = lambda xt, r, t: diffusion_model(weights, [xt, r, t])
urt, durt = jax.jvp(u, (xt, r, t), (u(xt, t, t), r * 0, t * 0 + 1))

Where urt is $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$, and durt is the corresponding JVP result. Torch usage is similar. After understanding the JVP operation, the implementation of objective $\eqref{eq:loss-1}$ has virtually no difficulty.

The Second Objective

If there is one disadvantage to objective function $\eqref{eq:loss-1}$, in my opinion, it is the relatively large computational cost. This is because it requires two different forward passes $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$ and $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)$, then a gradient calculation via JVP, and another gradient calculation when using gradient descent for optimization. So it essentially requires second-order gradients, similar to WGAN-GP.

To reduce computation, we could consider adding a stop_gradient operation ($\sg{\cdot}$) to the JVP part:

\begin{equation}\mathbb{E}_{r,t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\sg{\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)} - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right]\label{eq:loss-2}\end{equation}

This avoids taking the gradient of the JVP portion again (though it still requires two forward passes). Experimental results show that compared to the first objective $\eqref{eq:loss-1}$, the above objective trained with a gradient optimizer can be nearly twice as fast, with no visible loss in quality.

Note that the stop_gradient here is purely for the purpose of reducing computation; the actual optimization direction is still towards minimizing the loss function value. This is different from CM-series models, especially sCM, where their loss functions are just equivalent losses with equivalent gradients and do not necessarily become smaller. In those models, stop_gradient is often mandatory; removing it would almost certainly lead to training collapse.

The Third Objective

As mentioned earlier, the other solution for $d\boldsymbol{x}_t/dt$ in identity $\eqref{eq:id1}$ is to replace it with $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$, leading to:

\begin{equation}\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) = \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right]\end{equation}

If we solve for $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$, the result would be:

\begin{equation}\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) = \left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right]\cdot\left[\boldsymbol{I} - (t-r)\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right]^{-1}\end{equation}

This involves a massive matrix inversion and is thus impractical. MeanFlow provides a compromise: since the regression target for $d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$ is $\boldsymbol{x}_1 - \boldsymbol{x}_0$, why not simply replace $d\boldsymbol{x}_t/dt$ with $\boldsymbol{x}_1 - \boldsymbol{x}_0$? Then the objective function becomes:

\begin{equation}\mathbb{E}_{r,t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[(\boldsymbol{x}_1-\boldsymbol{x}_0)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right]\end{equation}

However, $\boldsymbol{x}_1 - \boldsymbol{x}_0$ now appears as both the regression target and a term in the definition of the model $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)$, which inevitably gives a sense of "label leakage." To avoid this, MeanFlow similarly applies stop_gradient to the JVP part:

\begin{equation}\mathbb{E}_{r,t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\sg{(\boldsymbol{x}_1-\boldsymbol{x}_0)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)} - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right]\label{eq:loss-3}\end{equation}

This is the final loss function used in MeanFlow, which we call the "third objective." Compared to the second objective $\eqref{eq:loss-2}$, it eliminates one forward pass $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)$, so training speed is even faster. However, the introduction of "label leakage" and the stop_gradient countermeasure make the training of the third objective coupled with the gradient optimizer, which, like CM, adds a bit of vague mystery.

The paper's experimental results show that objective $\eqref{eq:loss-3}$ with $\sg{\cdot}$ can yield reasonable results. What if we remove it? I asked the author, and he indicated that without $\sg{\cdot}$, the training still converges and multi-step generation works, but the one-step generation capability is lost. This is actually easy to understand because when $r=t$, regardless of whether $\sg{\cdot}$ is present, the objective function reduces to ReFlow:

\begin{equation}\mathbb{E}_{t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t) - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\Vert^2\right]\label{eq:loss-reflow-2}\end{equation}

In other words, MeanFlow always has ReFlow "backing it up," so it won't perform too poorly. However, removing $\sg{\cdot}$ exacerbates the negative impact of "label leakage," making it inferior to including it.

A Proof

Can we theoretically prove, like ReFlow, that the optimal solution of the third objective $\eqref{eq:loss-3}$ is indeed our expected average velocity model? Let's try. First, we recall two key lemmas used to prove ReFlow:

1. $\mathop{\text{argmin}}_{\boldsymbol{\mu}}\mathbb{E}[\Vert\boldsymbol{\mu} - \boldsymbol{x}\Vert^2] = \mathbb{E}[\boldsymbol{x}]$. That is, the optimal solution for minimizing the squared error between $\boldsymbol{\mu}$ and $\boldsymbol{x}$ is the mean of $\boldsymbol{x}$;
2. The ODE form solution that transforms $\boldsymbol{x}_1$ to $\boldsymbol{x}_0$ according to the distribution trajectory $\boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t\boldsymbol{x}_1$ is $d\boldsymbol{x}_t/dt = \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1-\boldsymbol{x}_0]$.

The proof of Lemma 1 is straightforward; taking the gradient with respect to $\boldsymbol{\mu}$ gives $\mathbb{E}[\boldsymbol{\mu} - \boldsymbol{x}] = \boldsymbol{\mu} - \mathbb{E}[\boldsymbol{x}]$, setting it to zero yields the result. The details for Lemma 2 need to be seen in "Generative Diffusion Model Chat (17): General Steps for Building ODEs (Part 2)", where $\mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1-\boldsymbol{x}_0]$ requires using $\boldsymbol{x}_t=(1-t)\boldsymbol{x}_0 + t\boldsymbol{x}_1$ to eliminate $\boldsymbol{x}_1$, resulting in a function of $\boldsymbol{x}_0, \boldsymbol{x}_t$, and then taking the expectation over distribution $p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)$, resulting in a function of $t, \boldsymbol{x}_t$.

Using Lemma 1, we can prove that the theoretical optimal solution of the ReFlow objective $\eqref{eq:loss-reflow}$ is $\boldsymbol{v}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t,t) = \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1-\boldsymbol{x}_0]$. Combined with Lemma 2, we get that $d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t,t)$ is our desired ODE. The proof for the third objective $\eqref{eq:loss-3}$ is similar. Since there is $\sg{\cdot}$ inside, taking the gradient with respect to $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)$ and setting it to zero results in:

\begin{equation}\begin{aligned} \boldsymbol{0} =&\, \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}\left[(t-r)\left[(\boldsymbol{x}_1-\boldsymbol{x}_0)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t)\right] - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right] \\ =&\, \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + (t-r)\left[\mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1-\boldsymbol{x}_0]\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t)\right] - \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1 - \boldsymbol{x}_0] \\ =&\, \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + (t-r)\left[\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t)\right] - \frac{d\boldsymbol{x}_t}{dt} \\ =&\, \frac{d}{dt}\left[(t - r) \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) - (\boldsymbol{x}_t - \boldsymbol{x}_r)\right] \\ \end{aligned}\end{equation}

So under appropriate boundary conditions, we have $\boldsymbol{x}_t - \boldsymbol{x}_r = (t - r) \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t)$, which is our expected average velocity model.

The key to this process is that the introduction of $\sg{\cdot}$ avoids taking gradients of the JVP portion, thus simplifying the gradient expression and obtaining the correct result. If $\sg{\cdot}$ were removed, the right-hand side would have to be multiplied by an extra Jacobian matrix of the JVP portion with respect to $\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t)$, making it impossible to separate out the term $\frac{d}{dt}\left[(t - r) \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) - (\boldsymbol{x}_t - \boldsymbol{x}_r)\right]$. The mathematical significance of introducing $\sg{\cdot}$ is to solve this problem.

Of course, I maintain that the introduction of $\sg{\cdot}$ also couples the model training with the gradient optimizer, adding a layer of non-clarity. At this point, the point where the gradient equals zero is merely a stationary point rather than a (local) minimum, so stability is also unclear. This is actually a commonality of all models coupled with $\sg{\cdot}$.

Related Work

Interestingly, two acceleration papers we introduced before, "Generative Diffusion Model Chat (21): Accelerating ODE Sampling with Mean Value Theorem" and "Generative Diffusion Model Chat (27): Using Step Size as Conditional Input", are also centered on average velocity, and their ideas can be considered ancestral. Although there may not be direct communication between authors, their work provides a continuous sense of progression.

In the Mean Value Theorem post, the author already realized the importance of average velocity \begin{equation}\frac{1}{t-r}\int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau\end{equation} but his approach was to use the integral mean value theorem for 1D functions, trying to find $s \in [r,t]$ such that $\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_s,s)$ equals the average velocity. This is essentially still seeking a higher-order Solver, but it is no longer Training-Free, requiring a few distillation steps—a small breakthrough for Solvers.

The Shortcut model proposed in the step size input post almost touched upon MeanFlow. Using step size as an extra input is essentially equivalent to MeanFlow's dual time parameters $r, t$. The difference is that it uses properties of average velocity as an additional regularization term for training. Using the notation of this article, the property that average velocity should satisfy is \begin{equation}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) = \frac{1}{2}\left[\boldsymbol{u}_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t, s, t\right) + \boldsymbol{u}_{\boldsymbol{\theta}}\left(\boldsymbol{x}_s, r, s\right)\right]\end{equation} where $s = (r+t)/2$. Thus, Shortcut builds a regularization term with it: \begin{equation}\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) - \frac{1}{2}\sg{\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, s, t) + \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_s, r, s)}\right\Vert^2\end{equation} trained jointly with the ReFlow target $\eqref{eq:loss-reflow-2}$. In actual training, $\boldsymbol{x}_s = \boldsymbol{x}_t - (t-s)\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, s, t)$. In my view, the introduction of $\sg{\cdot}$ there was also mainly to save computation. The Shortcut model is actually more intuitive than MeanFlow, but because it lacked the identity transformations and the rigorous theoretical support brought by ReFlow, it appeared more like an empirical product of a transition period.

Consistency Models

Finally, let's discuss Consistency Models. Since CM and sCM have paved the way, the success of MeanFlow actually draws on their experience, especially the operation of adding $\sg{\cdot}$ to the JVP, which is mentioned in the original paper. Of course, one of the authors of MeanFlow, Kaiming He, is himself a master of manipulating gradients (e.g., SimSiam), so the emergence of MeanFlow feels very much like a natural progression.

We carefully analyzed discrete CM in "Generative Diffusion Model Chat (28): Understanding Consistency Models Step by Step". If we replace the EMA operator in CM with stop_gradient, take the gradient, and take the limit $\Delta t \to 0$, we get the objective function of sCM in "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models": \begin{equation}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\cdot \frac{d}{dt}\boldsymbol{f}_{\sg{\boldsymbol{\theta}}}(\boldsymbol{x}_t, t) = \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\cdot\sg{\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) + \frac{\partial}{\partial t}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}\label{eq:loss-scm}\end{equation} If we replace $\frac{d\boldsymbol{x}_t}{dt}$ with $\boldsymbol{x}_1 - \boldsymbol{x}_0$ and denote $\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) = \boldsymbol{x}_t - t\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t , 0, t)$, then its gradient is equivalent to the third objective $\eqref{eq:loss-3}$ of MeanFlow when $r=0$: \begin{equation}\begin{aligned} \nabla_{\boldsymbol{\theta}}\eqref{eq:loss-scm} =&\, \nabla_{\boldsymbol{\theta}}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\cdot \left[\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) + \frac{\partial}{\partial t}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right] \\[10pt] =&\, -t\nabla_{\boldsymbol{\theta}}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)\cdot \left[\frac{d\boldsymbol{x}_t}{dt} - t\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) - \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) - t\frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)\right] \\[10pt] =&\, t\nabla_{\boldsymbol{\theta}}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)\cdot \left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) + t\left[\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)\right]- \frac{d\boldsymbol{x}_t}{dt}\right] \\[10pt] =&\, \frac{t}{2}\nabla_{\boldsymbol{\theta}}\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) + t\sg{\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)}- \frac{d\boldsymbol{x}_t}{dt}\right\Vert^2 \\[10pt] \sim &\, \left.\nabla_{\boldsymbol{\theta}}\eqref{eq:loss-3}\right\|_{r=0} \end{aligned}\end{equation}

Therefore, from this perspective, sCM is a special case of MeanFlow when $r=0$. As previously mentioned, the introduction of another time parameter $r$ allows ReFlow to "back up" MeanFlow (when $r=t$), thereby better avoiding training collapse—one of its advantages. Of course, one could also start from sCM and introduce dual time parameters to get the same result as the third objective, but from an aesthetic standpoint, the physical meaning of sCM and CM is ultimately not as intuitive as the interpretation of "average velocity" in MeanFlow.

Furthermore, the starting point combining average velocity and ReFlow also yields the first objective $\eqref{eq:loss-1}$ and second objective $\eqref{eq:loss-2}$. For someone like me with a "stop_gradient clean habit," these are very comfortable and beautiful results. In my view, for computational reasons, we can consider adding stop_gradient to the loss function, but the first principles of derivation and basic results should not be coupled with stop_gradient; otherwise, it means being strongly coupled with the optimizer and dynamics, which is not what an essential result should look like.

Article Summary

This article centered around the recently released MeanFlow, discussing ideas for accelerating diffusion model generation from the perspective of "average velocity."