The Path to Optimal Distribution: Minimization in Probability Space

By 苏剑林 | August 6, 2024

When finding the minimum value of a function, we typically find the derivative of the function and then look for its zero points. In fortunate cases, one of these zeros is the minimum point of the original function. For vector functions, we change the derivative to a gradient and find its zero points. When it is difficult to find the zero points of the gradient, we can use gradient descent to gradually approach the minimum point.

These are basic results of unconstrained optimization, which many readers are likely familiar with. However, the theme of this article is optimization in the probability space—that is, the input to the objective function is a probability distribution. Optimization for this type of objective is more complex because its search space is no longer unconstrained. If we still solve for zero gradients or perform standard gradient descent, the results might not be guaranteed to remain probability distributions. Therefore, we need to find a new method of analysis and calculation to ensure that the optimization results conform to the characteristics of probability distributions.

Regarding this, I have always found it quite troublesome. Recently, "learning from painful experience," I decided to systematically study optimization problems in probability distributions. Finally, I have organized my learning results here for everyone's reference.

Gradient Descent

Let's first revisit basic unconstrained optimization. First, assume our goal is: \begin{equation}\boldsymbol{x}_* = \mathop{\text{argmin}}_{\boldsymbol{x}\in\mathbb{R}^n} F(\boldsymbol{x})\end{equation} Even high school students know that to find the extremum of a function, one often takes the derivative and sets it to zero to find the stationary points; for many, this has become "common sense." But let me test the readers: how many can prove this conclusion? In other words, why is the extremum of a function related to "the derivative being zero"?

Search Perspective

We can explore this problem from a search perspective. Suppose the $\boldsymbol{x}$ we currently know is $\boldsymbol{x}_t$. How do we determine if $\boldsymbol{x}_t$ is the minimum point? We can think about this in reverse: if we can find $\boldsymbol{x}_{t+\eta}$ such that $F(\boldsymbol{x}_{t+\eta}) < F(\boldsymbol{x}_t)$, then $\boldsymbol{x}_t$ naturally cannot be the minimum point. To this end, we can search for $\boldsymbol{x}_{t+\eta}$ in the following format: \begin{equation}\boldsymbol{x}_{t+\eta} = \boldsymbol{x}_t + \eta \boldsymbol{u}_t,\quad 0 < \eta \ll 1\end{equation} When $F(\boldsymbol{x})$ is sufficiently smooth and $\eta$ is small enough, we believe that the first-order approximation is accurate enough. Thus, we can use the first-order approximation: \begin{equation}F(\boldsymbol{x}_{t+\eta}) = F(\boldsymbol{x}_t + \eta \boldsymbol{u}_t) \approx F(\boldsymbol{x}_t) + \eta \boldsymbol{u}_t \cdot \nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t)\end{equation} As long as $\nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t)\neq 0$, we can choose $\boldsymbol{u}_t = -\nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t)$, making: \begin{equation}F(\boldsymbol{x}_{t+\eta}) \approx F(\boldsymbol{x}_t) - \eta \Vert\nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t)\Vert^2 < F(\boldsymbol{x}_t)\end{equation} This means that for a sufficiently smooth function, its minimum can only be reached at points satisfying $\nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t) = 0$ or at infinity. This is why the first step in finding the minimum is usually "setting the derivative to zero." If $\nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t) \neq 0$, we can always choose a sufficiently small $\eta$ to obtain a point that makes $f$ smaller via: \begin{equation}\boldsymbol{x}_{t+\eta} = \boldsymbol{x}_t-\eta\nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t)\label{eq:gd}\end{equation} This is gradient descent. If we let $\eta\to 0$, we can obtain the ODE: \begin{equation}\frac{d\boldsymbol{x}_t}{dt} = -\nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t)\end{equation} This is the "Gradient Flow" introduced in "Gradient Flow: Exploring the Path to the Minimum", which can be seen as the trajectory of our search for the minimum using gradient descent.

Projected Descent

What we just talked about was unconstrained optimization. Now let's briefly discuss a simple generalization of gradient descent in constrained optimization. Suppose the problem we face is: \begin{equation}\boldsymbol{x}_* = \mathop{\text{argmin}}_{\boldsymbol{x}\in\mathbb{X}} F(\boldsymbol{x})\label{eq:c-loss}\end{equation} where $\mathbb{X}$ is a subset of $\mathbb{R}^n$. If doing theoretical analysis, $\mathbb{X}$ is usually required to be a "bounded convex set," but for a simple understanding, we can ignore these details for now.

If we still use gradient descent $\eqref{eq:gd}$ at this time, the biggest problem is that it cannot be guaranteed that $\boldsymbol{x}_{t+\eta}\in\mathbb{X}$. However, we can actually add a projection operation: \begin{equation}\Pi_{\mathbb{X}} (\boldsymbol{y}) = \mathop{\text{argmin}}_{\boldsymbol{x}\in\mathbb{X}}\Vert\boldsymbol{x}-\boldsymbol{y}\Vert\label{eq:project}\end{equation} Forming "Projected Gradient Descent": \begin{equation}\boldsymbol{x}_{t+\eta} = \Pi_{\mathbb{X}}(\boldsymbol{x}_t-\eta\nabla_{\boldsymbol{x}_t}F(\boldsymbol{x}_t))\label{eq:pgd}\end{equation} Briefly put, projected gradient descent first performs gradient descent and then finds the point in $\mathbb{X}$ closest to the result of gravity descent as the output, which ensures the result is always inside $\mathbb{X}$. In "Scientific Alchemy (1): Average Loss Convergence of SGD", we proved that under certain assumptions, projected gradient descent can find the optimal solution to the constrained optimization problem $\eqref{eq:c-loss}$.

From the results, projected gradient descent transforms constrained optimization $\eqref{eq:c-loss}$ into two steps: "gradient descent + projection." The projection $\eqref{eq:project}$ itself is a constrained optimization problem. Although the optimization target is fixed, it remains a problem to be solved and requires specific analysis based on $\mathbb{X}$, so further exploration is needed.

Discrete Distributions

This article focuses on optimization in probability spaces, meaning the object of search must be a probability distribution. In this section, we first focus on discrete distributions. We denote the search space as $\Delta^{n-1}$, which is the set of all $n$-dimensional discrete probability distributions: \begin{equation}\Delta^{n-1} = \left\{\boldsymbol{p}=(p_1,p_2,\cdots,p_n)\left|\, p_1,p_2,\cdots,p_n\geq 0,\sum_{i=1}^n p_i = 1\right.\right\}\end{equation} Our optimization goal is: \begin{equation}\boldsymbol{p}_* = \mathop{\text{argmin}}_{\boldsymbol{p}\in\Delta^{n-1}} F(\boldsymbol{p})\label{eq:p-loss}\end{equation}

Lagrange Multipliers

For optimization problems under equality or inequality constraints, the standard method is usually the "Lagrange Multiplier Method," which transforms the constrained optimization problem $\eqref{eq:p-loss}$ into a weakly constrained $\text{min-max}$ problem: \begin{equation}\min_{\boldsymbol{p}\in\Delta^{n-1}} F(\boldsymbol{p}) = \min_{\boldsymbol{p}\in\mathbb{R}^n} \max_{\mu_i \geq 0,\lambda\in\mathbb{R}}F(\boldsymbol{p}) - \sum_{i=1}^n \mu_i p_i + \lambda\left(\sum_{i=1}^n p_i - 1\right)\label{eq:min-max}\end{equation} Note that in this $\text{min-max}$ optimization, we removed the constraint $\boldsymbol{p}\in\Delta^{n-1}$, and there is only a relative simple constraint $\mu_i \geq 0$ in the $\max$ step. How do we prove that the optimization on the right is equivalent to the one on the left? It's not difficult and can be understood in three steps:

1. First, we need to understand the meaning of the $\text{min-max}$ on the right: $\min$ is on the left, and $\max$ is on the right. This means we ultimately want an overall result as small as possible, but this objective function must first be maximized over certain variables;

2. If $p_i < 0$, then the $\max$ step inevitably leads to $\mu_i\to\infty$. At this time, the value of the objective function is $\infty$. If $p_i \geq 0$, then the $\max$ step results in $\mu_i p_i = 0$, and the function value is finite. Clearly, the latter is smaller. Therefore, when the right side takes the optimal value, $p_i\geq 0$ holds. Similarly, we can prove that $\sum_{i=1}^n p_i = 1$ holds;

3. From the analysis in step 2, we know that when the right side takes the optimal value, it must satisfy $\boldsymbol{p}\in\Delta^{n-1}$ and the extra terms are zero, which is equivalent to the optimization problem on the left.

Next, we use a "Minimax Theorem":

If $\mathbb{X}, \mathbb{Y}$ are two convex sets, $\boldsymbol{x}\in\mathbb{X}, \boldsymbol{y}\in\mathbb{Y}$, and $f(\boldsymbol{x}, \boldsymbol{y})$ is convex with respect to $\boldsymbol{x}$ (for any fixed $\boldsymbol{y}$) and concave with respect to $\boldsymbol{y}$ (for any fixed $\boldsymbol{x}$), then: \begin{equation}\min_{\boldsymbol{x}\in\mathbb{X}}\max_{\boldsymbol{y}\in\mathbb{Y}} f(\boldsymbol{x},\boldsymbol{y}) = \max_{\boldsymbol{y}\in\mathbb{Y}}\min_{\boldsymbol{x}\in\mathbb{X}} f(\boldsymbol{x},\boldsymbol{y})\end{equation}

The Minimax Theorem provides a sufficient condition for swapping $\min$ and $\max$. A "convex set" refers to a set where the weighted average of any two points in the set is still in the set, i.e., \begin{equation}(1-\lambda)\boldsymbol{x}_1 + \lambda \boldsymbol{x}_2\in \mathbb{X},\qquad\forall \boldsymbol{x}_1,\boldsymbol{x}_2\in \mathbb{X},\quad\forall \lambda\in [0, 1]\end{equation} It can be seen that the condition for a convex set is not too harsh; $\mathbb{R}^n, \Delta^{n-1}$, and the set of all non-negative numbers are all convex sets.

For the objective function on the right side of formula $\eqref{eq:min-max}$, it is a linear function with respect to $\mu_i, \lambda$, so it satisfies the condition of being concave with respect to $\mu_i, \lambda$. Except for $F(\boldsymbol{p})$, the other terms are also linear with respect to $\boldsymbol{p}$, so the convexity of the entire objective function with respect to $\boldsymbol{p}$ is equivalent to the convexity of $F(\boldsymbol{p})$ with respect to $\boldsymbol{p}$. That is, if $F(\boldsymbol{p})$ is a convex function with respect to $\boldsymbol{p}$, then the $\min$ and $\max$ of formula $\eqref{eq:min-max}$ can be swapped: \begin{equation}\small\min_{\boldsymbol{p}\in\mathbb{R}^n} \max_{\mu_i \geq 0,\lambda\in\mathbb{R}}F(\boldsymbol{p}) - \sum_{i=1}^n \mu_i p_i + \lambda\left(\sum_{i=1}^n p_i - 1\right) = \max_{\mu_i \geq 0,\lambda\in\mathbb{R}} \min_{\boldsymbol{p}\in\mathbb{R}^n} F(\boldsymbol{p}) - \sum_{i=1}^n \mu_i p_i + \lambda\left(\sum_{i=1}^n p_i - 1\right)\end{equation} In this way, we can first minimize over $\boldsymbol{p}$. This is an unconstrained minimization problem, which can be completed by solving a system of equations where the gradient equals zero. The result will contain parameters $\lambda$ and $\mu_i$, which are finally determined by $p_i \geq 0$, $\mu_i p_i = 0$, and $\sum_{i=1}^n p_i = 1$.

Convex Set Search

However, although the Lagrange multiplier method is regarded as the standard method for solving constrained optimization problems, it is not intuitive. Moreover, it can only obtain exact solutions by solving equations and cannot derive iterative approximation algorithms similar to gradient descent. Therefore, we cannot be satisfied only with the Lagrange multiplier method.

From a search perspective, the key to solving optimization problems in probability space is to ensure that the test points are within the set $\Delta^{n-1}$ during the search process. In other words, assuming the current probability distribution is $\boldsymbol{p}_t\in \Delta^{n-1}$, how do we construct the next trial point $\boldsymbol{p}_{t+\eta}$? It has two requirements: first, $\boldsymbol{p}_{t+\eta}\in \Delta^{n-1}$, and second, the proximity of $\boldsymbol{p}_{t+\eta}$ to $\boldsymbol{p}_t$ can be controlled by the size of $\eta$. At this time, the "convex set" nature of $\Delta^{n-1}$ comes in handy. Using this property, we can define $\boldsymbol{p}_{t+\eta}$ as: \begin{equation}\boldsymbol{p}_{t+\eta} = (1-\eta)\boldsymbol{p}_t + \eta \boldsymbol{q}_t,\quad \boldsymbol{q}_t\in \Delta^{n-1}\end{equation} Then we have: \begin{equation}F(\boldsymbol{p}_{t+\eta}) = F((1-\eta)\boldsymbol{p}_t + \eta \boldsymbol{q}_t) \approx F(\boldsymbol{p}_t) + \eta(\boldsymbol{q}_t - \boldsymbol{p}_t)\cdot\nabla_{\boldsymbol{p}_t} F(\boldsymbol{p}_t)\end{equation} Assuming the accuracy of the first-order approximation is sufficient, to obtain the direction of fastest descent, it is equivalent to solving: \begin{equation}\mathop{\text{argmin}}_{\boldsymbol{q}_t\in\Delta^{n-1}}\,\boldsymbol{q}_t\cdot\nabla_{\boldsymbol{p}_t} F(\boldsymbol{p}_t) \end{equation} This objective function is very simple. The answer is: \begin{equation}\boldsymbol{q}_t = \text{onehot}(\text{argmin}(\nabla_{\boldsymbol{p}_t} F(\boldsymbol{p}_t)))\end{equation} Here $\nabla_{\boldsymbol{p}_t} F(\boldsymbol{p}_t)$ is a vector, and performing $\text{argmin}$ on a vector refers to finding the position of the smallest component. Therefore, the above formula means that $\boldsymbol{q}_t$ is a one-hot distribution, where the position of the $1$ is the position of the smallest component of the gradient $\nabla_{\boldsymbol{p}_t} F(\boldsymbol{p}_t)$.

From this, we see that the gradient descent form for probability spaces is: \begin{equation}\boldsymbol{p}_{t+\eta} = (1 - \eta)\boldsymbol{p}_t + \eta\, \text{onehot}(\text{argmin}(\nabla_{\boldsymbol{p}_t} F(\boldsymbol{p}_t)))\end{equation} And the condition for $\boldsymbol{p}_t$ to be the local minimum of $F(\boldsymbol{p}_t)$ is: \begin{equation}\boldsymbol{p}_t\cdot\nabla_{\boldsymbol{p}_t} F(\boldsymbol{p}_t) = (\nabla_{\boldsymbol{p}_t} F(\boldsymbol{p}_t))_{\min}\label{eq:p-min}\end{equation} Here, the $\min$ of a vector refers to returning the smallest component.

An Example

Taking the Sparsemax introduced in "The Path to Probability Distribution: Reviewing Softmax and its Alternatives" as an example, its original definition is: \begin{equation}Sparsemax(\boldsymbol{x}) = \mathop{\text{argmin}}\limits_{\boldsymbol{p}\in\Delta^{n-1}}\Vert \boldsymbol{p} - \boldsymbol{x}\Vert^2\end{equation} where $\boldsymbol{x}\in\mathbb{R}^n$. It is not difficult to find that from the perspective of projected gradient descent mentioned earlier, Sparsemax is exactly the "projection" operation from $\mathbb{R}^n$ to $\Delta^{n-1}$.

We denote $F(\boldsymbol{p})=\Vert \boldsymbol{p} - \boldsymbol{x}\Vert^2$. Its gradient with respect to $\boldsymbol{p}$ is $2(\boldsymbol{p} - \boldsymbol{x})$. Therefore, according to formula $\eqref{eq:p-min}$, the equation satisfied by the minimum point is: \begin{equation}\boldsymbol{p}\cdot(\boldsymbol{p}-\boldsymbol{x}) = (\boldsymbol{p}-\boldsymbol{x})_{\min}\end{equation} We agree that $x_i = x_j\Leftrightarrow p_i = p_j$. Here, the subscript without bold like $p_i$ represents the $i$-th component of vector $\boldsymbol{p}$ (a scalar), while the bold subscript like $\boldsymbol{p}_t$ mentioned in the previous section represents the $t$-th iteration result (a vector). Please distinguish carefully.

Under this convention, we can get from the above formula: \begin{equation}p_i > 0 \quad \Leftrightarrow \quad p_i-x_i = (\boldsymbol{p}-\boldsymbol{x})_{\min}\end{equation} Since $\boldsymbol{p}$ can be determined by $\boldsymbol{x}$, $(\boldsymbol{p}-\boldsymbol{x})_{\min}$ is a function of $\boldsymbol{x}$, which we denote as $-\lambda(\boldsymbol{x})$. Then $p_i = x_i - \lambda(\boldsymbol{x})$. But this only holds for $p_i > 0$. For $p_i=0$, we have $p_i-x_i > (\boldsymbol{p}-\boldsymbol{x})_{\min}$, i.e., $x_i - \lambda(\boldsymbol{x}) < 0$. Based on these two points, we can uniformly record: \begin{equation}p_i = \text{relu}(x_i - \lambda(\boldsymbol{x}))\end{equation} where $\lambda(\boldsymbol{x})$ is determined by the sum of components of $\boldsymbol{p}$ being 1. For other details, please refer to "The Path to Probability Distribution: Reviewing Softmax and its Alternatives".

Continuous Distributions

Having talked about discrete distributions, we now turn to continuous distributions. It seems that continuous distributions are just the limit version of discrete ones, and the results shouldn't be much different. But in fact, their characteristics are essentially different, such that we need to build a whole new methodology for continuous distributions.

Objective Functional

First, let's talk about the objective function. We know that the way to describe continuous distributions is the probability density function. So the input to the objective function at this time is a probability density function. In fact, the objective function at this time is no longer an ordinary function; we usually call it a "functional"—a mapping from an entire function to a scalar. In other words, we need to find a probability density function that minimizes a certain objective functional.

Although many people feel that "functional analysis strikes fear into the heart" (泛函分析心犯寒), in fact, most of us have been exposed to functionals because there are too many mappings that satisfy "input function, output scalar," such as the definite integral: \begin{equation}\mathcal{I}[f]\triangleq \int_a^b f(x) dx\end{equation} It is a mapping from a function to a scalar, so it is also a functional. In fact, the functionals we encounter in practical applications are basically constructed from definite integrals, such as the KL divergence of a probability distribution: \begin{equation}\mathcal{KL}[p\Vert q] = \int p(\boldsymbol{x})\log \frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}d\boldsymbol{x}\end{equation} where the integral defaults to the entire space (the whole $\mathbb{R}^n$). More general functionals may also contain derivative terms in the integrand, such as the least action in theoretical physics: \begin{equation}\mathcal{A}[x] = \int_{t_a}^{t_b} L(x(t),x'(t),t)dt\end{equation}

The objective functional we want to minimize next can be generally written as: \begin{equation}\mathcal{F}[p] = \int F(p(\boldsymbol{x}))d\boldsymbol{x}\end{equation} For convenience, we can also define the functional derivative: \begin{equation}\frac{\delta\mathcal{F}[p]}{\delta p}(\boldsymbol{x}) = \frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\end{equation}

Compact Support Set

In addition, we need a notation for the continuous probability space, the basic definition of which is: \begin{equation}\mathbb{P} = \left\{p(\boldsymbol{x}) \,\Bigg\|\, p(\boldsymbol{x})\geq 0(\forall\boldsymbol{x}\in\mathbb{R}^n),\int p(\boldsymbol{x})d\boldsymbol{x} = 1\right\}\end{equation} It is not difficult to prove that if the limit $\lim_{\Vert\boldsymbol{x}\Vert\to\infty} p(\boldsymbol{x})$ of the probability density function $p(\boldsymbol{x})$ exists, then it must be $\lim_{\Vert\boldsymbol{x}\Vert\to\infty} p(\boldsymbol{x}) = 0$. This is also a property to be used in later proofs.

However, examples can be given that not all probability density functions have a limit at infinity. To avoid theoretical difficulties, we usually assume that the support set of $p(\boldsymbol{x})$ is a compact set in theoretical proofs. There are two concepts here: Support and Compact Set. The support set refers to the set of all $\boldsymbol{x}$ such that $p(\boldsymbol{x}) > 0$, i.e., \begin{equation}\text{supp}(p) = \{\boldsymbol{x} | p(\boldsymbol{x}) > 0\}\end{equation} The general definition of a compact set is relatively complex, but in $\mathbb{R}^n$, a compact set is equivalent to a bounded closed set. So, simply put, the assumption that the support set of $p(\boldsymbol{x})$ is a compact set directly gives $p(\boldsymbol{x})$ the property of "there exists a constant $C$ such that $\forall \|\boldsymbol{x}\| > C, p(\boldsymbol{x}) = 0$", simplifying the behavior of $p(\boldsymbol{x})$ at infinity and fundamentally avoiding the discussion of $\lim_{\Vert\boldsymbol{x}\Vert\to\infty} p(\boldsymbol{x}) = 0$.

From a theoretical point of view, this assumption is very strong; it even excludes simple distributions like the normal distribution (the support set of the normal distribution is $\mathbb{R}^n$). However, in practice, this assumption is not ridiculous. Because we said that if the limit $\lim_{\Vert\boldsymbol{x}\Vert\to\infty} p(\boldsymbol{x})$ exists, it must be zero. Therefore, after exceeding a certain range, there is not much difference from being equal to zero. There are indeed examples where limits do not exist, but they generally need to be constructed quite deliberately. For data we can encounter in practice, they basically satisfy the condition that the limit exists.

Old Path is Impassable

Intuitively, the optimization of continuous distributions should follow the logic of discrete distributions, that is, let $\boldsymbol{p}_{t+\eta}(\boldsymbol{x}) = (1 - \eta)\boldsymbol{p}_t(\boldsymbol{x}) + \eta \boldsymbol{q}_t(\boldsymbol{x})$, because like discrete distributions, the set of probability density functions $\mathbb{P}$ is also a convex set. Now we substitute it into the objective functional: \begin{equation}\begin{aligned} \mathcal{F}[p_{t+\eta}] =&\, \int F(p_{t+\eta}(\boldsymbol{x}))d\boldsymbol{x} \\ =&\, \int F((1 - \eta)\boldsymbol{p}_t(\boldsymbol{x}) + \eta \boldsymbol{q}_t(\boldsymbol{x}))d\boldsymbol{x} \\ \approx&\,\int \left[F(p_t(\boldsymbol{x})) + \eta\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\Big(q_t(\boldsymbol{x}) - p_t(\boldsymbol{x})\Big)\right]d\boldsymbol{x} \end{aligned}\end{equation} Assuming the first-order approximation is sufficient, the problem is transformed into: \begin{equation}\mathop{\text{argmin}}_{q_t\in \mathbb{P}}\int\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}q_t(\boldsymbol{x})d\boldsymbol{x}\end{equation} This problem is not difficult to solve. The answer is similar to the one-hot in discrete distribution: \begin{equation}q_t(\boldsymbol{x}) = \delta\left(\boldsymbol{x} - \mathop{\text{argmin}}_{\boldsymbol{x}'} \frac{\partial F(p_t(\boldsymbol{x}'))}{\partial p_t(\boldsymbol{x}')}\right)\end{equation} The $\delta(\cdot)$ here is the Dirac delta function, which representing the probability density of a single-point distribution.

It seems very smooth, but in reality, this path is impassable. First, the Dirac delta function is not a function in the conventional sense; it is a generalized function (also a type of functional). Second, if we look at it from the perspective of an ordinary function, the Dirac delta function has an infinitely large value at a certain point. Since it is an infinitely large value, the assumption that "the first-order approximation is sufficient" in the derivation process cannot hold.

Variable Substitution

We can consider continuing to patch the derivation in the previous section, such as adding the limit $q_t(\boldsymbol{x}) \leq C$ to obtain a meaningful result. However, this kind of patching eventually appears not elegant enough. But if we don't utilize the properties of convex sets, how can we construct the next trial distribution $\boldsymbol{p}_{t+\eta}(\boldsymbol{x})$?

At this time, we must give full play to the characteristics of the probability density function—we can transform one probability density function into another through variable substitution. This is a unique property of continuous distributions. Specifically, if $p(\boldsymbol{x})$ is a probability density function and $\boldsymbol{y}=\boldsymbol{T}(\boldsymbol{x})$ is an invertible transformation, then $p(\boldsymbol{T}(\boldsymbol{x}))\left\|\frac{\partial \boldsymbol{T}(\boldsymbol{x})}{\partial\boldsymbol{x}}\right\|$ is also a probability density function, where $\|\cdot\|$ represents the absolute value of the determinant of the matrix.

Based on this property, we define the next probability distribution to be tested as: \begin{equation}\begin{aligned} p_{t+\eta}(\boldsymbol{x}) =&\, p_t(\boldsymbol{x} + \eta\boldsymbol{\mu}_t(\boldsymbol{x}))\left\|\boldsymbol{I} + \eta\frac{\partial \boldsymbol{\mu}_t(\boldsymbol{x})}{\partial\boldsymbol{x}}\right\| \\ \approx &\, \Big[p_t(\boldsymbol{x}) + \eta\boldsymbol{\mu}_t(\boldsymbol{x})\cdot\nabla_{\boldsymbol{x}} p_t(\boldsymbol{x})\Big]\left[1 + \eta\,\text{Tr}\frac{\partial \boldsymbol{\mu}_t(\boldsymbol{x})}{\partial\boldsymbol{x}}\right] \\[3pt] \approx &\, p_t(\boldsymbol{x}) + \eta\boldsymbol{\mu}_t(\boldsymbol{x})\cdot\nabla_{\boldsymbol{x}} p_t(\boldsymbol{x}) + \eta\, p_t(\boldsymbol{x})\nabla_{\boldsymbol{x}}\cdot\boldsymbol{\mu}_t(\boldsymbol{x}) \\[5pt] = &\, p_t(\boldsymbol{x}) + \eta\nabla_{\boldsymbol{x}}\cdot\big[p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\big] \\ \end{aligned}\end{equation} The same result was derived in "Diffusion Models (12): 'Hard-core' Diffusion ODE". For the approximate expansion of the determinant, please refer to the article "Derivative of Deterinants".

Integral Transformation

Using this new $p_{t+\eta}(\boldsymbol{x})$, we can get: \begin{equation}\begin{aligned} \mathcal{F}[p_{t+\eta}] =&\, \int F(p_{t+\eta}(\boldsymbol{x}))d\boldsymbol{x} \\ \approx&\, \int F\Big(p_t(\boldsymbol{x}) + \eta\nabla_{\boldsymbol{x}}\cdot\big[p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\big]\Big)d\boldsymbol{x} \\ \approx&\, \int \left[F(p_t(\boldsymbol{x})) + \eta\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\nabla_{\boldsymbol{x}}\cdot\big[p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\big]\right]d\boldsymbol{x} \\ =&\, \mathcal{F}[p_t] + \eta\int \frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\nabla_{\boldsymbol{x}}\cdot\big[p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\big] d\boldsymbol{x} \\ \end{aligned}\label{eq:px-approx}\end{equation} Next, we need to derive an integral identity related to the probability density, as done in "Deriving the Continuity Equation and Fokker-Planck Equation using Test Functions". First, we have: \begin{equation}\begin{aligned} &\,\int \nabla_{\boldsymbol{x}}\cdot\left[\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})} p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\right] d\boldsymbol{x} \\[5pt] =&\, \int \frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\nabla_{\boldsymbol{x}}\cdot\big[p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\big] d\boldsymbol{x} + \int \left(\nabla_{\boldsymbol{x}}\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\right)\cdot\big[p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\big] d\boldsymbol{x} \end{aligned}\end{equation} According to the Divergence Theorem, we have: \begin{equation}\int_{\Omega} \nabla_{\boldsymbol{x}}\cdot\left[\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})} p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\right] d\boldsymbol{x} = \int_{\partial\Omega} \left[\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})} p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\right]\cdot \hat{\boldsymbol{n}} dS\end{equation} where $\Omega$ is the integral region, here the whole $\mathbb{R}^n$, $\partial\Omega$ is the boundary, and the boundary of $\mathbb{R}^n$ is naturally at infinity. $\hat{\boldsymbol{n}}$ is the outward unit normal vector of the boundary, and $dS$ is the area element. Under the assumption of compact support, $p_t(\boldsymbol{x})=0$ at infinity, so the right side of the above formula is actually an integral of zero, and the result is zero. Therefore, we have: \begin{equation}\int \frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\nabla_{\boldsymbol{x}}\cdot\big[p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\big] d\boldsymbol{x} = - \int \left(\nabla_{\boldsymbol{x}}\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\right)\cdot\big[p_t(\boldsymbol{x})\boldsymbol{\mu}_t(\boldsymbol{x})\big] d\boldsymbol{x}\end{equation} Substituting into equation $\eqref{eq:px-approx}$ yields: \begin{equation}\mathcal{F}[p_{t+\eta}] \approx \mathcal{F}[p_t] - \eta\int \left(p_t(\boldsymbol{x})\nabla_{\boldsymbol{x}}\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\right)\cdot \boldsymbol{\mu}_t(\boldsymbol{x}) d\boldsymbol{x} \label{eq:px-approx-2}\end{equation}

Gradient Flow

According to equation $\eqref{eq:px-approx-2}$ , a simple choice to make $\mathcal{F}[p_{t+\eta}] \leq \mathcal{F}[p_t]$ is: \begin{equation}\boldsymbol{\mu}_t(\boldsymbol{x}) = \nabla_{\boldsymbol{x}}\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\end{equation} The corresponding iteration format is: \begin{equation}p_{t+\eta}(\boldsymbol{x}) \approx p_t(\boldsymbol{x}) + \eta\nabla_{\boldsymbol{x}}\cdot\left[p_t(\boldsymbol{x})\nabla_{\boldsymbol{x}}\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\right] \end{equation} We can also take the limit $\eta\to 0$ to get: \begin{equation}\frac{\partial}{\partial t}p_t(\boldsymbol{x}) = \nabla_{\boldsymbol{x}}\cdot\left[p_t(\boldsymbol{x})\nabla_{\boldsymbol{x}}\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\right] \end{equation} Or simplified as: \begin{equation}\frac{\partial p_t}{\partial t} = \nabla\cdot\left[p_t\nabla\frac{\delta \mathcal{F}[p_t]}{\delta p_t}\right] \end{equation} This is the Wasserstein gradient flow introduced in "Gradient Flow: Exploring the Path to the Minimum", but we obtained the same result here without introducing the concept of Wasserstein distance.

Since $p_{t+\eta}(\boldsymbol{x})$ is obtained by transformation $\boldsymbol{x}\to \boldsymbol{x} + \eta \boldsymbol{\mu}_t(\boldsymbol{x})$ of $p_t(\boldsymbol{x})$, we can also write the motion trajectory ODE of $\boldsymbol{x}$: \begin{equation}\boldsymbol{x}_t = \boldsymbol{x}_{t+\eta} + \eta \boldsymbol{\mu}_t(\boldsymbol{x}_{t+\eta})\quad\Rightarrow\quad \frac{d \boldsymbol{x}_t}{dt} = -\boldsymbol{\mu}_t(\boldsymbol{x}_t) = -\nabla_{\boldsymbol{x}}\frac{\partial F(p_t(\boldsymbol{x}))}{\partial p_t(\boldsymbol{x})}\end{equation} The meaning of this ODE is that starting from the sampling result $\boldsymbol{x}_0$ of distribution $p_0(\boldsymbol{x})$, when moving according to this ODE to $\boldsymbol{x}_t$, the distribution followed by $\boldsymbol{x}_t$ is exactly $p_t(\boldsymbol{x})$.

Conclusion

This article systematically summarizes the minimization methods of objective functions in probability space, including the necessary conditions for reaching the minimum and iteration methods similar to gradient descent. Related results are occasionally used in scenarios such as optimization and generative models (especially diffusion models).