Exponentiated Gradient Descent + Meta-Learning = Adaptive Learning Rate

By 苏剑林 | March 03, 2022

A couple of days ago, I came across a paper from Google titled "Step-size Adaptation Using Exponentiated Gradient Updates", where I learned some new concepts, so I am documenting and sharing them here. There are two main pieces of content: one is Exponentiated Gradient Descent for non-negative optimization, and the other is a learning rate adjustment algorithm based on the idea of meta-learning. Both are quite interesting, and readers who are interested can also take a look.

Exponentiated Gradient Descent

You have likely heard much about gradient descent, which refers to the minimization of an unconstrained function $\mathcal{L}(\boldsymbol{\theta})$ using the following update format:

\begin{equation}\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}_t)\end{equation}

where $\eta$ is the learning rate. However, many tasks are not always unconstrained. For the simplest non-negative constraints, we can instead use the following update format:

\begin{equation}\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t \odot \exp\left(- \eta\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}_t)\right)\label{eq:egd}\end{equation}

Here $\odot$ denotes element-wise multiplication (Hadamard product). It is easy to see that as long as the initialization $\boldsymbol{\theta}_0$ is non-negative, $\boldsymbol{\theta}_t$ will remain non-negative throughout the entire update process. This is the "Exponentiated Gradient Descent" used for non-negative constrained optimization.

How do we understand this "Exponentiated Gradient Descent"? It's not difficult; it can be derived by converting it into an unconstrained case. If $\boldsymbol{\theta}$ is non-negative, then $\boldsymbol{\varphi}=\log\boldsymbol{\theta}$ can be positive or negative. Therefore, we can set $\boldsymbol{\theta}=e^{\boldsymbol{\varphi}}$ to transform it into an unconstrained optimization problem with respect to $\boldsymbol{\varphi}$, which can then be solved using gradient descent:

\begin{equation}\boldsymbol{\varphi}_{t+1} = \boldsymbol{\varphi}_t - \eta\nabla_{\boldsymbol{\varphi}}\mathcal{L}(e^{\boldsymbol{\varphi}_t}) = \boldsymbol{\varphi}_t - \eta e^{\boldsymbol{\varphi}_t}\odot\nabla_{e^{\boldsymbol{\varphi}}}\mathcal{L}(e^{\boldsymbol{\varphi}_t})\end{equation}

We consider that the $e^{\boldsymbol{\varphi}_t}\odot$ part of the gradient only serves to adjust the learning rate, so it is not fundamentally important. We omit it to obtain:

\begin{equation}\boldsymbol{\varphi}_{t+1} = \boldsymbol{\varphi}_t - \eta \nabla_{e^{\boldsymbol{\varphi}}}\mathcal{L}(e^{\boldsymbol{\varphi}_t})\end{equation}

Taking the exponential of both sides gives:

\begin{equation}e^{\boldsymbol{\varphi}_{t+1}} = e^{\boldsymbol{\varphi}_t}\odot\exp\left( - \eta \nabla_{e^{\boldsymbol{\varphi}}}\mathcal{L}(e^{\boldsymbol{\varphi}_t})\right)\end{equation}

Substituting back $\boldsymbol{\theta}=e^{\boldsymbol{\varphi}}$ yields Equation $\eqref{eq:egd}$.

Meta-Learning for Learning Rate Adjustment

Regarding Meta-Learning, many readers might be like me—having heard the term often but having had almost no contact with it. Simply put, the relationship between ordinary machine learning and meta-learning is like the relationship between a "function" and a "functional" in mathematics. A functional is a "function of functions," and meta-learning is "Learning How to Learn." In other words, it is a methodology about "learning" itself, such as what will be introduced next: "using gradient descent to adjust gradient descent."

We start from general gradient descent. Denoting the gradient of the objective function $\mathcal{L}$ as $\boldsymbol{g}$, the update formula is:

\begin{equation}\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\boldsymbol{g}_t\end{equation}

We want to adjust the learning rate for each component, so we introduce a non-negative variable $\boldsymbol{\nu}$ of the same size as the parameters and modify the update formula to:

\begin{equation}\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\boldsymbol{\nu}_{t+1}\odot\boldsymbol{g}_t\label{eq:update}\end{equation}

So, what rule should $\boldsymbol{\nu}$ follow for iteration? Remember that our ultimate goal is to minimize $\mathcal{L}$, so the update rule for $\boldsymbol{\nu}$ should also be gradient descent. Since $\boldsymbol{\nu}$ is required to be non-negative, we use exponentiated gradient descent:

\begin{equation}\boldsymbol{\nu}_{t+1} = \boldsymbol{\nu}_t \odot\exp\left(- \gamma\nabla_{\boldsymbol{\nu}_t}\mathcal{L}\right)\label{eq:update-nu}\end{equation}

Note that $\mathcal{L}$ is originally only a function of $\boldsymbol{\theta}$, but according to $\eqref{eq:update}$, at time $t$ we have $\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta\boldsymbol{\nu}_t\odot\boldsymbol{g}_{t-1}$. Therefore, according to the chain rule, we have:

\begin{equation}\nabla_{\boldsymbol{\nu}_t}\mathcal{L} = -\eta\boldsymbol{g}_{t-1} \odot\nabla_{\boldsymbol{\theta}_t}\mathcal{L}= -\eta\boldsymbol{g}_{t-1} \odot\boldsymbol{g}_t\end{equation}

Substituting this into the update formula for $\boldsymbol{\nu}$ in Equation $\eqref{eq:update-nu}$, we get:

\begin{equation}\boldsymbol{\nu}_{t+1} = \boldsymbol{\nu}_t \odot\exp\left( \gamma\eta\boldsymbol{g}_{t-1} \odot\boldsymbol{g}_t\right)\end{equation}

Combining $\gamma\eta$ into a single parameter $\gamma$, the update formulas for the entire model are:

\begin{equation}\begin{aligned}&\boldsymbol{\nu}_{t+1} = \boldsymbol{\nu}_t \odot\exp\left( \gamma\boldsymbol{g}_{t-1} \odot\boldsymbol{g}_t\right) \\ &\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\boldsymbol{\nu}_{t+1}\odot\boldsymbol{g}_t\end{aligned}\end{equation}

If $\boldsymbol{\nu}$ is initialized to all ones, then we will have:

\begin{equation}\boldsymbol{\nu}_{t+1} = \exp\left(\gamma\sum_{k=1}^t\boldsymbol{g}_{k-1} \odot\boldsymbol{g}_k\right)\end{equation}

As can be seen, the idea behind this learning rate adjustment method is: if the gradients of a certain component for two adjacent steps often have the same sign, then the accumulation result of the corresponding term is positive, meaning we can appropriately increase the learning rate; if the gradients of two adjacent steps often have opposite signs, then the accumulation result is likely negative, meaning we can appropriately decrease the learning rate.

Note that this is different from the idea of learning rate adjustment in Adam. The idea of Adam's learning rate adjustment is that if the gradient of a certain component remains very small for a long time, it means that the parameter may not have been learned well, so it attempts to increase its learning rate. Both methods have their own justifications.

A Brief Summary

This article primarily makes simple notes on the concepts of "Exponentiated Gradient Descent" and "Meta-Learning for Learning Rate Adjustment." "Exponentiated Gradient Descent" is a simple and effective scheme for non-negative constrained optimization, while "Meta-Learning for Learning Rate Adjustment" is a straightforward and easy-to-understand application of meta-learning. In introducing "Meta-Learning for Learning Rate Adjustment," I have made some simplifications compared to the form in the original paper, making it even simpler, though the underlying idea remains consistent.

,
    author={Su Jianlin},
    year={2022},
    month={Mar},
    url={\url{https://kexue.fm/archives/8968}},
}