By 苏剑林 | July 30, 2019
Recently, I implemented two optimizers using Keras. They involve some interesting implementation tricks, so I've decided to write a brief article to introduce them (if it were just one, I might not have written it). The names of these two optimizers are quite interesting: one is "Lookahead" and the other is "Lazy". Do they represent completely different optimization strategies? Not necessarily—it's more that the inventors were very creative with their naming.
Lookahead
First up is the Lookahead optimizer, which originates from the paper "Lookahead Optimizer: k steps forward, 1 step back". It is a recently proposed optimizer that features big names like Hinton and Jimmy Ba (one of the authors of Adam) in the author list. With the endorsement of these two giants, this optimizer has attracted significant attention.
The idea behind Lookahead is very simple. To be precise, it's not a standalone optimizer but rather a strategy for using existing optimizers. In simple terms, it executes the following three steps in a loop:
1. Back up the current model weights $\theta$;
2. Starting from $\theta$, update for $k$ steps using a specified optimizer to obtain new weights $\tilde{\theta}$;
3. Update the model weights as $\theta \leftarrow \theta + \alpha\left(\tilde{\theta} - \theta\right)$.
Below is my Keras implementation. The implementation style was mentioned in a previous article, "Making Keras even Cooler: Niche Custom Optimizers". It belongs to an "intrusive" style of writing:
https://github.com/bojone/keras_lookahead
Usage is very simple:
from keras_lookahead import Lookahead
from keras.optimizers import Adam
# Define your model first
# ...
# Wrap the optimizer with Lookahead
optimizer = Lookahead(Adam(1e-3), k=5, alpha=0.5)
model.compile(optimizer=optimizer, loss='mse')
Regarding its effectiveness, the original paper conducted several experiments. Some showed slight improvements (on CIFAR-10 and CIFAR-100), while others showed more significant gains (using LSTM for language modeling). I did a simple experiment myself, and the results showed virtually no change. I've always felt that optimizers are somewhat mysterious entities; sometimes you must use SGD to reach the optimum, and other times only Adam will converge. In short, one shouldn't expect that simply switching an optimizer will drastically improve model performance. Lookahead's existence simply provides us with one more choice. Readers with sufficient training time should feel free to try it out.
Appendix: "Machine Heart's Introduction to Lookahead" (Chinese)
LazyOptimizer
The LazyOptimizer is essentially prepared for NLP, or more accurately, for Embedding layers.
LazyOptimizer points out that all optimizers with momentum (which naturally includes Adam and SGD with momentum) share a problem: words (tokens) that are not sampled in the current batch are still updated using historical momentum. This can lead to overfitting of the Embedding layer (refer to the Zhihu discussion). Specifically, once a word has been sampled, the gradient for its Embedding entry is non-zero, and this gradient is recorded in the momentum. In subsequent batches, even if the word is not sampled, its Embedding gradient is zero, but its momentum is not. Consequently, the word is still updated. Thus, even tokens that aren't repeatedly sampled have their corresponding Embeddings updated repeatedly, leading to overfitting.
Therefore, an improved plan is to update only when the word has been sampled. This is the basic principle of LazyOptimizer.
In terms of implementation, how do we determine if a word has been sampled? The ultimate method would be to pass the indices of the sampled words, but that is not very user-friendly. I used an approximation here: check whether the gradient corresponding to the word's Embedding is zero. If it's zero, it means it "likely" was not sampled in the current batch. The rationale is that if it wasn't sampled, the gradient must be zero; if it was sampled, the probability of the gradient being exactly zero is extremely small (given so many components), so this implementation is sufficient.
My Keras implementation is located at:
https://github.com/bojone/keras_lazyoptimizer
The usage is also simple: wrap an optimizer that has momentum and pass in all the Embedding layers, making it a new "Lazy" version of the optimizer:
from keras_lazyoptimizer import LazyOptimizer
from keras.optimizers import Adam
# Define your model and get your embedding layers
# model = ...
# embedding_layers = [model.get_layer('embedding_1')]
# Wrap the optimizer
optimizer = LazyOptimizer(Adam(1e-3), embedding_layers=embedding_layers)
model.compile(optimizer=optimizer, loss='binary_crossentropy')
The GitHub repository also includes an IMDB example. In this example, if you use Adam(1e-3) directly as the optimizer, the highest validation accuracy is around 83.7%. However, if you use LazyOptimizer(Adam(1e-3), embedding_layers), the optimal validation accuracy can consistently reach above 84.9%. The effect is quite evident. In general, I think models with very large Embedding layers (especially word-based models) are worth a try. Essentially, because the parameter count of the Embedding layer is so large, reducing the update frequency allows the model to focus its optimization on the remaining parts.
Note: This LazyOptimizer is slightly different from the standard LazyOptimizer. In the standard version, for tokens that aren't sampled, all related cached quantities (like momentum, etc.) are also not updated. However, in my implementation, even if the token isn't sampled, its corresponding cached quantities are still updated. Some evaluations suggest that this approach actually yields better results.
Summary
There isn't much else to add. I've implemented two optimizers in Keras to let Keras users try them out early or make their Keras experience a bit more interesting.