By 苏剑林 | Nov 25, 2019
Optimizers might be one of the most "metaphysical" modules in deep learning: sometimes switching an optimizer brings a significant improvement, while other times an optimizer that others claim is great does absolutely nothing for your own task. Optimizers with good theoretical properties don't necessarily work well, and those born purely from a stroke of inspiration aren't necessarily bad. Regardless, optimizers provide another choice for those who love "deep learning alchemy."
In recent years, work regarding optimizers seems to be slowly increasing. Many papers have proposed various improvements to commonly used optimizers (especially Adam). This article summarizes several optimizer works or techniques and provides a unified code implementation for readers to use as needed.
Basic Form
The term "derived" refers to the fact that these techniques are built upon existing optimizers. Any existing optimizer can utilize these techniques to transform into a new optimizer.
The basic form of an existing optimizer is:
\begin{equation}
\begin{aligned}
\boldsymbol{g}_t =&\, \nabla_{\boldsymbol{\theta}} L\\
\boldsymbol{h}_t =&\, f(\boldsymbol{g}_{\leq t})\\
\boldsymbol{\theta}_{t+1} =&\, \boldsymbol{\theta}_t - \gamma \boldsymbol{h}_t
\end{aligned}
\end{equation}
Where $\boldsymbol{g}_t$ is the gradient, and $\boldsymbol{g}_{\leq t}$ refers to all gradient information up to the current step. After some operation $f$ (such as accumulated momentum, accumulated second-order moment corrected learning rate, etc.), we obtain $\boldsymbol{h}_t$, which is then used to update the parameters. Here $\gamma$ denotes the learning rate.
Assorted Variants
Below are introductions to 6 variants of optimizers, which can also be understood as techniques for using optimizers. These techniques are sometimes very effective, while other times they may be ineffective or even counterproductive; they cannot be generalized, but rather understood as "more choices provide more possibilities."
Weight Decay
Weight decay refers to directly adding a decay term after each update step of the optimizer:
\begin{equation}
\begin{aligned}
\boldsymbol{g}_t =&\, \nabla_{\boldsymbol{\theta}} L\\
\boldsymbol{h}_t =&\, f(\boldsymbol{g}_{\leq t})\\
\boldsymbol{\theta}_{t+1} =&\, \boldsymbol{\theta}_t - \gamma \boldsymbol{h}_t - \gamma \lambda \boldsymbol{\theta}_t
\end{aligned}
\end{equation}
Where $\lambda$ is called the "decay rate." In SGD, weight decay is equivalent to adding an $l_2$ regularization term $\frac{1}{2}\lambda \Vert \boldsymbol{\theta}\Vert_2^2$ to the loss. However, in optimizers with adaptive learning rates like Adagrad and Adam, $f$ becomes non-linear, so the two are no longer equivalent. The paper "Decoupled Weight Decay Regularization" specifically points out that the anti-overfitting ability of weight decay is superior to its corresponding $l_2$ regularization, recommending the use of weight decay instead of $l_2$ regularization.
Layer Adaptation
In an optimizer, the final update amount is determined by $\boldsymbol{h}_t$ and the learning rate $\gamma$. Sometimes the magnitude of $\boldsymbol{h}_t$ can be greater than the magnitude of the parameters $\boldsymbol{\theta}_t$, which may lead to unstable updates. Therefore, a direct idea is: the update magnitude of parameters in each layer should be regulated by the magnitude of $\Vert\boldsymbol{\theta}_t\Vert_2$. This direct idea leads to the following optimizer variant:
\begin{equation}
\begin{aligned}
\boldsymbol{g}_t =&\, \nabla_{\boldsymbol{\theta}} L\\
\boldsymbol{h}_t =&\, f(\boldsymbol{g}_{\leq t})\\
\boldsymbol{\theta}_{t+1} =&\, \boldsymbol{\theta}_t - \gamma \boldsymbol{h}_t \times \frac{\Vert\boldsymbol{\theta}_t\Vert_2}{\Vert\boldsymbol{h}_t\Vert_2}
\end{aligned}
\end{equation}
If the base optimizer is Adam, the above optimizer is LAMB. The paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes" points out that LAMB is more effective than Adam when the batch size is large (thousands or more).
Piecewise Linear Learning Rate
The learning rate is also a mysterious existence in optimizers. Generally speaking, fine-tuning the learning rate strategy can yield improvements, while an inappropriate learning rate might even prevent the model from converging. Common learning rate strategies include warmup, exponential decay, step-wise decay (e.g., dropping to 1/10 after a certain epoch), as well as more exotic strategies like cosine or polynomial decay.
Considering that common functions can be approximated with piecewise linear functions, I have introduced a piecewise linear learning rate strategy for everyone to experiment with. The form is as follows:
\begin{equation}
\begin{aligned}
\boldsymbol{g}_t =&\, \nabla_{\boldsymbol{\theta}} L\\
\boldsymbol{h}_t =&\, f(\boldsymbol{g}_{\leq t})\\
\boldsymbol{\theta}_{t+1} =&\, \boldsymbol{\theta}_t - \gamma \rho_t\boldsymbol{h}_t
\end{aligned}
\end{equation}
Where $\rho_t$ is some piecewise linear function with step count $t$ as its independent variable.
Gradient Accumulation
Gradient accumulation was introduced earlier in "Trading Time for Results: Keras Gradient Accumulation Optimizer." It isn't strictly an optimizer variant, but it can be written into the optimizer to achieve the effect of a large batch size using small batch sizes, realizing a trade-off between time and space. Larger batch sizes can sometimes improve results, especially when the baseline batch size is too small (below 8?).
To restate the description of gradient descent from that article:
Gradient accumulation is quite simple. The gradient used for gradient descent is actually the average of gradients calculated from multiple samples. Taking batch_size=128 as an example, you could calculate the gradient for 128 samples at once and average them, or you could calculate the average gradient for 16 samples each time, cache and accumulate them, and after doing this 8 times, divide the total gradient by 8 and execute the parameter update. Of course, you must accumulate 8 times and use the average gradient of those 8 times to update parameters; you cannot update every time you calculate 16 samples, or else your batch_size would just be 16.
Lookahead
The Lookahead optimizer comes from the paper "Lookahead Optimizer: k steps forward, 1 step back," and was also introduced in a previous article "Implementing Two Optimizers in Keras: Lookahead and LazyOptimizer." The logic of Lookahead is to use a common optimizer to explore several steps forward and then update based on the exploration results. The process is as follows:
\begin{equation}
\begin{aligned}
&\boldsymbol{g}_t = \nabla_{\boldsymbol{\theta}} L\\
&\boldsymbol{h}_t = f(\boldsymbol{g}_{\leq t})\\
&\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \gamma\boldsymbol{h}_t\\
&\text{If } t \pmod k = 0:\\
&\qquad\boldsymbol{\Theta}_{t+1} = \boldsymbol{\Theta}_t + \alpha (\boldsymbol{\theta}_{t+1}- \boldsymbol{\Theta}_t)\\
&\qquad\boldsymbol{\theta}_{t+1} = \boldsymbol{\Theta}_{t+1} \,(\text{i.e., overwrite the original } \boldsymbol{\theta}_{t+1})
\end{aligned}
\end{equation}
Actually, calling this optimizer "Lookback" would also be fine, as it looks back every few steps and performs an interpolation with the weights from a few steps ago.
Lazy Optimizer
The Lazy Optimizer was also introduced in the article "Implementing Two Optimizers in Keras: Lookahead and LazyOptimizer." Its original intent is that updates for Embedding layers should be more sparse, which helps prevent overfitting (Reference Zhihu discussion).
Reference Implementation
The preceding introductions were quite simple; in fact, these variants are not difficult to understand. The key lies in the code implementation. From the descriptions above, one can see that these 6 variants are not contradictory. Therefore, a good implementation should allow us to use them like building blocks, combining one or more variants together. Additionally, Keras currently has two branches: pure Keras and tf.keras. A good implementation should be compatible with both (or provide implementations for both simultaneously).
The Ultimate Grafting Trick
Although some variants have been implemented before, I have re-implemented them here using a new method. This implementation method stems from a "grafting" trick I accidentally discovered.
Suppose we have a class like this:
import numpy as np
class A(object):
def __init__(self):
self.a = np.ones(1)
self.b = np.ones(2)
self.c = np.ones(3)
Then suppose we want to inherit class A to get class B, and class B needs to replace all np.ones in the __init__ method with np.zeros, while everything else remains the same. Since __init__ might be a very complex process, it would be too redundant to copy it in its entirety and rewrite it.
Is there a way to replace everything with just a few lines of code? There actually is!
class B(A):
def __init__(self):
_ = np.ones
np.ones = np.zeros
super(B, self).__init__()
np.ones = _
With this demo, we can "mod" existing optimizers. In Keras, parameter updates are all implemented through K.update (refer to Keras optimizers.py). We only need to redefine K.update using the method above.
What about tf.keras? Unfortunately, this method doesn't work because the iteration processes of common optimizers in tf.keras are written in C++ (refer to tf.keras adam.py). We cannot see the code, so we cannot modify it this way. A solution is to re-implement an optimizer like Adam ourselves, exposing the iteration process so we can use the grafting method to mod it.
Usage Examples
The 6 optimizer variants implemented uniformly based on the above ideas are placed in my bert4keras project: bert4keras.optimizers.
All functions will perform the correct imports based on whether you are using Keras or tf.keras, achieving the same usage method for both. It comes with its own Adam implementation, which is specifically written for tf.keras. For tf.keras, if you want to implement the above variants, you can only use the optimizers provided by bert4keras (currently only Adam), not the built-in tf.keras optimizers.
Example Code:
from bert4keras.optimizers import *
# Transform into Adam with weight decay
AdamW = extend_with_weight_decay(Adam, 'AdamW')
optimizer = AdamW(learning_rate=0.001, weight_decay_rate=0.01)
# Transform into Adam with piecewise linear learning rate
AdamLR = extend_with_piecewise_linear_lr(Adam, 'AdamLR')
# Implement warmup: learning rate increases from 0 to 0.001 over the first 1000 steps
optimizer = AdamLR(learning_rate=0.001, lr_schedule={1000: 1.})
# Transform into Adam with gradient accumulation
AdamGA = extend_with_gradient_accumulation(Adam, 'AdamGA')
optimizer = AdamGA(learning_rate=0.001, grad_accum_steps=10)
# Combination usage
AdamWLR = extend_with_piecewise_linear_lr(AdamW, 'AdamWLR')
# Optimizer with both weight decay and warmup
optimizer = AdamWLR(learning_rate=0.001,
weight_decay_rate=0.01,
lr_schedule={1000: 1.})
(Note: Implementing so many optimizers while trying to remain compatible with both Keras and tf.keras means there may inevitably be errors or omissions. If any are found, I would greatly appreciate your corrections.)
Closing Remarks
Alchemy is not easy; cherish it as you go.