By 苏剑林 | September 15, 2020
A huge reason for the immense success of deep learning is that gradient-based optimization algorithms (SGD, Adam, etc.) can effectively solve most neural network models. However, since they are based on gradients, they require the model to be differentiable. As research progresses, we often find ourselves needing to solve non-differentiable models. Typical examples include directly optimizing metrics like accuracy, F1, or BLEU, or incorporating non-differentiable modules (such as "skip-reading" operations) within a neural network.
This article will briefly introduce two effective methods for solving non-differentiable models: Policy Gradient, one of the most important methods in reinforcement learning, and Zeroth-Order Optimization, which requires no gradients at all. On the surface, these two optimization methods seem to have completely different approaches, but this article will further demonstrate that for a large class of optimization problems, the two are essentially equivalent.
Formal Description
First, let us formally define the problem we need to solve. Taking supervised learning as an example, assume training data $(x_t, y_t) \sim \mathcal{D}$, the model is $p_{\theta}(y|x)$, and $\theta$ is the parameter to be optimized with dimension $d$. Suppose the model itself is differentiable, and its general form is $softmax(f_{\theta}(y|x)/\tau)$, where $\tau$ is the temperature parameter (defaulting to $\tau=1$ unless otherwise specified). If the true label is $y_t$ and the predicted label is $y_p$, then the score for a single sample is denoted as $r(y_t, y_p)$. The training objective is to maximize the total score:
\begin{equation}\theta = \mathop{\text{argmax}}_{\theta}\mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[r\left(y_t, \mathop{\text{argmax}}_y p_{\theta}(y|x_t)\right)\right]\label{eq:base}\end{equation}
This looks quite complex, but its meaning is actually intuitive and clear: we want to find the parameter $\theta$ such that the score $r(y_t, y_p)$ over the entire dataset is as high as possible, where $y_p = \mathop{\text{argmax}}_y p_{\theta}(y|x_t)$ means the model outputs the $y$ with the highest probability during prediction. Simply put, we want "the $y$ with the highest predicted probability to be the $y$ with the highest evaluation score."
This formulation corresponds to a considerable number of machine learning tasks. In NLP, this includes text classification, sequence labeling, text generation, etc.; even regression problems can be mapped onto this. It is highly representative. The fundamental difficulty is that the $\mathop{\text{argmax}}_y$ step does not provide a useful gradient, making it difficult to use gradient-based optimization algorithms directly.
Policy Gradient
The idea behind Policy Gradient is straightforward: since the original objective $\eqref{eq:base}$ cannot be differentiated, we replace it with a strongly related objective that is differentiable, such as:
\begin{equation}\theta = \mathop{\text{argmax}}_{\theta}\mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[\sum_y p_{\theta}(y|x_t) r\left(y_t, y\right)\right]\label{eq:policy}\end{equation}
Rearrangement Inequality
Clearly, the objective defined above does not contain operators like $\mathop{\text{argmax}}_y$, so it is differentiable. So, what is the relationship between this equation and the original objective $\eqref{eq:base}$? What are the differences? This can be answered by the "Rearrangement Inequality" in mathematics:
Rearrangement Inequality: For $a_1 \geq a_2 \geq \dots \geq a_n$ and $b_1 \geq b_2 \geq \dots \geq b_n$, and assuming $(c_1, c_2, \dots, c_n)$ is any permutation of $(b_1, b_2, \dots, b_n)$, then:
\begin{equation}\sum_{i=1}^n a_i b_i \geq \sum_{i=1}^n a_i c_i \geq \sum_{i=1}^n a_i b_{n+1-i}\end{equation}
In other words, "the sum of products of similarly ordered sequences ≥ sum of products of mixed sequences ≥ sum of products of oppositely ordered sequences."
The rearrangement inequality is a classic inequality. Its proof (usually via mathematical induction) can be easily found online, so we omit it here. Based on the rearrangement inequality, we know that if the objective $\eqref{eq:policy}$ reaches its maximum, then $p_{\theta}(y|x_t)$ and $r(y_t, y)$ must be ordered similarly. This means we indeed achieve the goal of "the $y$ with the highest predicted probability being the $y$ with the highest score." However, it also simultaneously strives for goals like "the $y$ with the second highest predicted probability being the $y$ with the second highest score," which is not strictly required by the original objective. Therefore, objective $\eqref{eq:policy}$ is strongly related to the original objective but imposes more requirements.
Notably, the rearrangement inequality does not require $a_i, b_i$ to be non-negative, so the actual scoring function $r(y_t, y)$ could potentially take negative values.
Sampling Estimated Gradient
Once we have determined that objective $\eqref{eq:policy}$ is feasible, we can calculate its gradient:
\begin{equation}\mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[\sum_y \nabla_{\theta} p_{\theta}(y|x_t) r\left(y_t, y\right)\right]\label{eq:policy-grad-base}\end{equation}
In general, calculating the gradient $\nabla_{\theta} p_{\theta}(y|x_t)$ is not difficult, but $\sum_y$ is the main bottleneck because it requires summing over all candidate categories, and the number of candidates in practice might be prohibitively large. Related discussions appeared in the previous post "A Discussion on Reparameterization: From Normal Distribution to Gumbel Softmax". Therefore, a better approach is to convert it into a sampling estimate form:
\begin{equation}\begin{aligned}
&\mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[\sum_y p_{\theta}(y|x_t)\frac{\nabla_{\theta} p_{\theta}(y|x_t)}{p_{\theta}(y|x_t)} r\left(y_t, y\right)\right]\\
=& \mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[\sum_y p_{\theta}(y|x_t)r\left(y_t, y\right)\nabla_{\theta} \log p_{\theta}(y|x_t)\right]\\
=& \mathbb{E}_{(x_t,y_t)\sim\mathcal{D}, y\sim p_{\theta}(y|x_t)}\left[r\left(y_t, y\right)\nabla_{\theta}\log p_{\theta}(y|x_t)\right]
\end{aligned}\end{equation}
In principle, we only need to sample an appropriate number of $y$'s to estimate the above expression. The result is what is known as the "Policy Gradient." With the gradient in hand, we can use standard optimizers to complete the optimization.
Reducing Variance
As mentioned earlier, adding a constant to $r(y_t, y)$ does not change the final theoretical result. However, it can change the efficiency of the sampling estimation. In statistical terms, it changes the variance of the samples. For a simple example, consider $[4, 5, 6]$ and $[-10, 10, 15]$. Both have a mean of 5 (representing the target we want to estimate), but their variances are 0.67 and 116.67, respectively. If we only take one sample, the maximum error for the former is 1, while for the latter, it could reach 15. Thus, although the theoretical means are the same, the former's estimation efficiency is much higher (fewer samples for higher precision).
This simple example tells us that to improve estimation efficiency, we must find a way to obtain an estimator with smaller variance. We can subtract a constant $b$ (called a baseline; "constant" here means it does not depend on $y$, though it can depend on $x$) from $r(y_t, y)$:
\begin{equation}\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[(r\left(y_t, y\right)-b)\nabla_{\theta}\log p_{\theta}(y|x_t)\right]\label{eq:var-reduce}\end{equation}
The final result (the mean) does not change:
\begin{equation}\begin{aligned}
&\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[(r\left(y_t, y\right)-b)\nabla_{\theta}\log p_{\theta}(y|x_t)\right]\\
=&\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[r\left(y_t, y\right)\nabla_{\theta}\log p_{\theta}(y|x_t)\right]-b\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[\nabla_{\theta}\log p_{\theta}(y|x_t)\right]\\
=&\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[r\left(y_t, y\right)\nabla_{\theta}\log p_{\theta}(y|x_t)\right]-b\sum_y \nabla_{\theta} p_{\theta}(y|x_t)\\
=&\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[r\left(y_t, y\right)\nabla_{\theta}\log p_{\theta}(y|x_t)\right]-b \nabla_{\theta} \sum_y p_{\theta}(y|x_t)\\
=&\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[r\left(y_t, y\right)\nabla_{\theta}\log p_{\theta}(y|x_t)\right]-b \nabla_{\theta} 1\\
=&\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[r\left(y_t, y\right)\nabla_{\theta}\log p_{\theta}(y|x_t)\right]\\
\end{aligned}\end{equation}
However, the variance may change. We want to minimize the variance. Since $\mathbb{Var}[x]=\mathbb{E}[x^2]-\mathbb{E}[x]^2$, minimizing variance is equivalent to minimizing the second moment:
\begin{equation}\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[(r\left(y_t, y\right)-b)^2\Vert\nabla_{\theta}\log p_{\theta}(y|x_t)\Vert^2\right]\end{equation}
This is a simple quadratic minimization problem. The optimal $b$ is found to be:
\begin{equation}b = \frac{\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[r\left(y_t, y\right)\Vert\nabla_{\theta}\log p_{\theta}(y|x_t)\Vert^2\right]}{\mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[\Vert\nabla_{\theta}\log p_{\theta}(y|x_t)\Vert^2\right]}\end{equation}
That is, a weighted expectation of $r(y_t, y)$ with weights $\Vert\nabla_{\theta}\log p_{\theta}(y|x_t)\Vert^2$. However, acquiring the gradient for every candidate is computationally expensive. We usually ignore these weights and use a simplified version:
\begin{equation}b = \mathbb{E}_{y\sim p_{\theta}(y|x_t)}\left[r\left(y_t, y\right)\right]\end{equation}
Combined with Eq. $\eqref{eq:var-reduce}$, the ideology is quite intuitive: sample several $y$'s from $p_{\theta}(y|x_t)$, compute the average score $b$ for $r(y_t, y)$, perform gradient ascent for those $y$'s above the average (reinforcement) and gradient descent for those below (weakening).
In a Nutshell
In simple terms, Policy Gradient replaces non-differentiable objectives (like those with $\mathop{\text{argmax}}$) with a differentiable objective $\eqref{eq:policy}$. In reinforcement learning terms, $y$ is the "policy," $p_{\theta}(y|x_t)$ is the "policy model," and $r(y_t, y)$ is the "reward." By combining sampling estimation and variance reduction techniques, one obtains an effective gradient estimate for the original model, thus completing the optimization.
Zeroth-Order Optimization
Zeroth-order optimization refers to all optimization methods that do not require gradient information. In general contexts, it specifically refers to optimization algorithms that estimate the parameter update direction based on parameter sampling and finite difference principles. Formally, it samples directly in the parameter space and does not depend on any form of gradient. Theoretically, it can solve an extremely wide variety of objectives. However, because it samples directly in the parameter space (acting like a more intelligent grid search), its efficiency is quite low in high-dimensional parameter spaces (such as deep learning scenarios), thus its application scope is limited.
Nevertheless, it is worth learning the ideas behind zeroth-order optimization. Having more skills never hurts. Furthermore, while deep learning models often have many parameters, we usually design them such that most modules are differentiable, with only a small part being non-differentiable. Therefore, it might be possible to optimize the differentiable parts using gradient-based optimizers and only the non-differentiable parts using zeroth-order optimization. This idea has appeared in many NAS (Neural Architecture Search) papers.
Zeroth-Order Gradient
Zeroth-order optimization does not require the gradient as we usually define it, but it defines a "substitute" based on sampling and differences, which we will call the "Zeroth-Order Gradient":
For a scalar function $f(x)$, we define its zeroth-order gradient at $x$ as:
\begin{equation}\tilde{\nabla}_{x}f(x)=\mathbb{E}_{u\sim p(u)}\left[\frac{f(x + \varepsilon u) - f(x)}{\varepsilon}u\right]\label{eq:zero-grad}\end{equation}
where $\varepsilon$ is a pre-defined small positive number, and $p(u)$ is a pre-specified distribution with mean 0 and identity covariance matrix (usually the standard normal distribution).
As seen, by sampling several points from $p(u)$, one can estimate the zeroth-order gradient. Once the zeroth-order gradient is obtained, it can be treated as a regular gradient for standard optimizers. This is the basic idea of zeroth-order optimization. Specifically, if $f(x)$ itself is differentiable, then $f(x + \varepsilon u)=f(x)+\varepsilon u^{\top}\nabla_x f(x) + \mathcal{O}(\varepsilon^2)$, so as $\varepsilon \to 0$:
\begin{equation}\tilde{\nabla}_{x}f(x)=\int p(u) u \left(u^{\top}\nabla_x f(x)\right)du=\int p(u) \left(u u^{\top}\right)\nabla_x f(x)du=\nabla_x f(x)\end{equation}
Thus, $\tilde{\nabla}_{x}f(x)$ is equal to the ordinary gradient, meaning $\tilde{\nabla}_{x}f(x)$ is indeed a reasonable generalization of the ordinary gradient.
Also Has a Baseline
Attentive readers might notice that in definition $\eqref{eq:zero-grad}$, since the mean of $p(u)$ is 0, the term $-f(x)$ does not actually affect the final theoretical result, i.e.:
\begin{equation}\tilde{\nabla}_{x}f(x)=\mathbb{E}_{u\sim p(u)}\left[\frac{f(x + \varepsilon u)}{\varepsilon}u\right] - \mathbb{E}_{u\sim p(u)}\left[\frac{f(x)}{\varepsilon}u\right]=\mathbb{E}_{u\sim p(u)}\left[\frac{f(x + \varepsilon u)}{\varepsilon}u\right]\label{eq:zero-grad-equal}\end{equation}
So what is the point of $-f(x)$? Just like Policy Gradient earlier, its purpose is to reduce variance. We can similarly introduce $b$ and minimize the second moment (equivalent to minimizing variance):
\begin{equation}\mathbb{E}_{u\sim p(u)}\left[\left(\frac{f(x + \varepsilon u)-b}{\varepsilon}\right)^2\Vert u\Vert^2\right]\end{equation}
Solving for the optimal $b$ yields:
\begin{equation}b=\frac{\mathbb{E}_{u\sim p(u)}\left[f(x + \varepsilon u)\Vert u\Vert^2\right]}{\mathbb{E}_{u\sim p(u)}\left[\Vert u\Vert^2\right]}\end{equation}
In practice, one can use a finite number of samples to estimate this expression. In fact, if $f(x)$ is differentiable, using a Taylor expansion to approximate the integral gives $f(x) + \mathcal{O}(\epsilon^2)$. From this perspective, choosing $b=f(x)$ is a reasonable choice, justifying the inclusion of the $-f(x)$ term.
In a Nutshell
Zeroth-order optimization primarily uses differences to define a reasonable generalization of the gradient. Since calculating differences does not require function differentiability, it naturally applies to both differentiable and non-differentiable objectives. Of course, as the dimension of $u$ matches the dimension of all parameters $\theta$, for deep learning models with massive parameters, the variance during updates is very large, making convergence difficult. Therefore, zeroth-order optimization is usually used only for a small portion of the model parameters or as an auxiliary means (e.g., alternating between "differentiable objective + normal gradient + large learning rate" and "non-differentiable objective + zeroth-order gradient + small learning rate"). There is also research on applying zeroth-order optimization directly in high-dimensional spaces (e.g., "Gradientless Descent: High-Dimensional Zeroth-Order Optimization"), but they have not yet been widely adopted.
Furthermore, the unified perspective introduced in "Optimization from Sampling: A Unified View of Differentiable and Non-Differentiable Optimization" can also be seen as a form of zeroth-order optimization. It is a more general extension of common optimization ideas (from which one can derive gradient descent, Newton's method, zeroth-order gradients, etc.), but in principle, it suffers from the same "common ailments" like high variance that affect zeroth-order gradients.
Unity in Diversity
On the surface, Policy Gradient and Zeroth-Order Optimization do share many similarities, such as both needing random sampling to estimate gradients and both needing variance reduction. Of course, differences also abound: Policy Gradient samples from the policy space, while Zeroth-Order Optimization samples from the parameter space. Furthermore, Policy Gradient is essentially still based on gradients, whereas Zeroth-Order Optimization theoretically requires no gradients at all.
So, what exactly is the relationship between the two? Next, we will demonstrate that for the optimization problem $\eqref{eq:base}$ proposed at the start of this article, the two are essentially equivalent. The proof strategy involves calculating the zeroth-order gradient for objective $\eqref{eq:base}$ and finding that, after a series of simplifications, it is fundamentally the Policy Gradient.
Partitioning the Full Space
Let us denote:
\begin{equation}\mathcal{R}_{\theta}=\mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[r\left(y_t, \mathop{\text{argmax}}_y p_{\theta}(y|x_t)\right)\right]\end{equation}
Then, according to equation $\eqref{eq:zero-grad-equal}$, its zeroth-order gradient is:
\begin{equation}\begin{aligned}
\tilde{\nabla}_{\theta}\mathcal{R}_{\theta}=&\frac{1}{\varepsilon}\int \mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[r\left(y_t, \mathop{\text{argmax}}_y p_{\theta + \varepsilon u}(y|x_t)\right)\right] p(u) u du\\
=&\mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[\frac{1}{\varepsilon}\int r\left(y_t, \mathop{\text{argmax}}_y p_{\theta + \varepsilon u}(y|x_t)\right)p(u) u du\right]
\end{aligned}\end{equation}
This can be transformed into:
\begin{equation}\begin{aligned}
\tilde{\nabla}_{\theta}\mathcal{R}_{\theta}=& \mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[\frac{1}{\varepsilon}\sum_y\int_{\Omega_{y|x_t}} r\left(y_t, y\right)p(u) u du\right]\\
\Omega_{y|x_t} =& \left\{u\left|u\in\mathbb{R}^d, y=\mathop{\text{argmax}}_{\hat{y}} p_{\theta + \varepsilon u}(\hat{y}|x_t)\right.\right\}
\end{aligned}\end{equation}
This looks complex, but the idea is simple: different $u$ values result in different prediction outcomes $\mathop{\text{argmax}}_{y} p_{\theta + \varepsilon u}(y|x_t)$. We group all $u$ that produce the same prediction $y$ into a set $\Omega_{y|x_t}$. In this way, the entire space $\mathbb{R}^d$ is partitioned into disjoint subsets $\Omega_{y|x_t}$. Integrals without specified regions are over the whole space $\mathbb{R}^d$. After partitioning, it equals the sum of integrals over the subsets $\Omega_{y|x_t}$.
Indicator Function
Next, we use an "indicator function" trick, defining:
\begin{equation}\chi(y|x_t, u)=\left\{\begin{aligned}1, & u\in\Omega_{y|x_t}\\
0, & u\not\in\Omega_{y|x_t}\end{aligned}\right.\end{equation}
Then:
\begin{equation}\begin{aligned}
\frac{1}{\varepsilon}\sum_y\int_{\Omega_{y|x_t}} r\left(y_t, y\right)p(u) u du=\frac{1}{\varepsilon}\sum_y\int \chi(y|x_t, u) r\left(y_t, y\right)p(u) u du
\end{aligned}\end{equation}
Recalling the meaning of $\Omega_{y|x_t}$, for any $u \in \Omega_{y|x_t}$, the model $p_{\theta + \varepsilon u}(\cdot|x_t)$'s prediction is $y$, and the indicator output is 1. We can see that essentially:
\begin{equation}\chi(y|x_t, u)=\lim_{\tau\to 0} p_{\theta + \varepsilon u}(y|x_t)\end{equation}
where $\tau$ is the temperature parameter in simple softmax (explained at the beginning of the article). Based on this conclusion, we can approximately replace $\chi(y|x_t, u)$ with $p_{\theta + \varepsilon u}(y|x_t)$ and substitute it into the expression:
\begin{equation}
\tilde{\nabla}_{\theta}\mathcal{R}_{\theta} \approx \mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[\frac{1}{\varepsilon}\sum_y\int p_{\theta + \varepsilon u}(y|x_t) r\left(y_t, y\right)p(u) u du\right]
\end{equation}
Approximate Integration
Finally, since $p_{\theta + \varepsilon u}(y|x_t)$ is differentiable, we use the expansion $p_{\theta + \varepsilon u}(y|x_t) \approx p_{\theta}(y|x_t) + \varepsilon u^{\top}\nabla_{\theta}p_{\theta}(y|x_t)$ to perform the integration over $u$:
\begin{equation}
\tilde{\nabla}_{\theta}\mathcal{R}_{\theta} \approx \mathbb{E}_{(x_t,y_t)\sim\mathcal{D}}\left[\sum_y r\left(y_t, y\right)\nabla_{\theta}p_{\theta}(y|x_t)\right]
\end{equation}
Comparing this to equation $\eqref{eq:policy-grad-base}$, we find that the right side of the above expression is exactly the Policy Gradient.
Thus, the zeroth-order gradient, despite its starkly different appearance, points in a direction fundamentally similar to the Policy Gradient. It is truly a case of "different roads leading to the same destination," reinforcing the feeling that "there is only one correct answer." This reminds me of Leo Tolstoy's famous quote:
Happy families are all alike; every unhappy family is unhappy in its own way.
The same applies to the update direction of parameters (correct update directions are all similar)!
Summary
This article introduced two schemes for dealing with non-differentiable optimization objectives: Policy Gradient and Zeroth-Order Optimization. They define new "gradients" as parameter update directions from two different perspectives. While they seemingly take different paths, my investigation shows that when dealing with optimization problems like those for which Policy Gradient is usually used, the update direction provided by Zeroth-Order Optimization is essentially equivalent. Furthermore, this article can serve as an introductory reference for reinforcement learning for beginners.