By 苏剑林 | September 08, 2018
Following up on my previous post "Make Keras Cooler!": Ingenious Layers and Fancy Callbacks...
Today, we will look at a niche requirement: custom optimizers.
Upon reflection, regardless of the framework used, the need for a custom optimizer is truly a niche among niches. Generally, for most tasks, we can brainlessly use Adam, while "alchemist" tuning experts often use SGD to achieve better results. In other words, whether you are a novice or an expert, there is rarely a need to define a custom optimizer.
So, what is the value of this article? In some scenarios, it can be quite useful. For instance, by learning how to write optimizers in Keras, you can gain a deeper understanding of algorithms like gradient descent, and you can see firsthand how concise and elegant the Keras source code is. Additionally, sometimes we can implement specific features through custom optimizers, such as rewriting optimizers for simple models (like Word2Vec) to use hard-coded gradients instead of automatic differentiation for speed, or implementing features like "soft batching."
First, let's look at the built-in optimizer code in Keras, located at:
https://github.com/keras-team/keras/blob/master/keras/optimizers.py
For simplicity, let's start with SGD. Although Keras's SGD implementation integrates momentum, Nesterov, and decay—which is convenient for use but not for learning—I have simplified it here to provide an example of a pure SGD algorithm:
This shouldn't need viel explanation, right? Doesn't it feel remarkably simple? Defining an optimizer isn't such a high-brow task after all~
Now, let's implement a slightly more complex feature: what I call "soft batch." I'm not exactly sure if that's the standard name, so let's use it for now. The general scenario is: suppose you have a large model, and your GPU can only handle a batch size of 16, but you want to achieve the effect of a batch size of 64. What can you do? One solution is to calculate a batch of 16 at a time, cache the gradients, and update the parameters only after 4 batches. That is, calculate gradients for every small batch, but update parameters only every 4 batches.
Update (July 8, 2019): The implementation below is incorrect and does not achieve the expected effect. For a corrected implementation, please see Keras Gradient Accumulation Optimizer: Trading Time for Results.
If you truly have this requirement, it can only be solved by modifying the optimizer. Based on the previous SGD example, the reference code is as follows:
This should also be easy to understand. While implementing momentum would be slightly more complex, the logic is the same. The key is to introduce an additional variable to store the accumulated gradients and a cond to control whether to update. The tasks originally performed by the optimizer are now only executed when cond is True (using the accumulated gradients instead of current ones). Compared to the original SGD, the changes are minimal.
The solution above for implementing optimizers is standard, meaning it follows Keras's design specifications, which makes it straightforward. However, I once wanted to implement an optimizer that could not be achieved this way. After reading the source code, I found an "intrusive" approach. This method acts like a "plug-in" or "hook"; it can implement the functionality I need, but it is not a standard implementation. I'd like to share it with you here.
The original requirement stems from the previous article Optimizing Algorithms from a Dynamical Systems Perspective (1): From SGD to Momentum Acceleration. It pointed out that gradient descent optimizers can be viewed as Euler's method for solving systems of differential equations. This suggests that there are many methods more advanced than Euler's method for solving differential equations; can they be applied to deep learning? For example, the slightly more advanced "Heun's Method":
$$ \begin{aligned} \tilde{p}_{i+1} &= p_i + \epsilon g(p_i) \\ p_{i+1} &= p_i + \frac{1}{2}\epsilon [g(p_i) + g(\tilde{p}_{i+1})] \end{aligned} $$Where $p$ is the parameter (vector), $g$ is the gradient, and $p_i$ represents the result of $p$ at the $i$-th iteration. This algorithm requires two steps. Essentially, it performs a standard gradient descent step first ("scouting"), then takes an average based on the result of the scout to find a more precise path. Equivalently, it can be rewritten as:
$$ \begin{aligned} \tilde{p}_{i+1} &= p_i + \epsilon g(p_i) \\ p_{i+1} &= \tilde{p}_{i+1} + \frac{1}{2}\epsilon [g(\tilde{p}_{i+1}) - g(p_i)] \end{aligned} $$This clearly shows that the second step is actually a fine-tuning of the gradient descent.
However, implementing such algorithms poses a challenge: gradients must be calculated twice—once for the parameters $p_i$ and once for the parameters $\tilde{p}_{i+1}$. In the standard optimizer definition, the get_updates method can only execute one step (corresponding to a single sess.run in the TensorFlow framework; those familiar with TF know it's hard to achieve this requirement with a single sess.run). Therefore, it cannot be implemented this way. After studying Keras's model training source code, I found it can be written like this:
Usage:
The key idea is commented in the code. Since Keras optimizers are ultimately wrapped into a train_function, we simply need to design our own train_function by referencing the Keras source code and inserting our own operations. Note that the operation defined by K.function is equivalent to a single sess.run.
Note: Similarly, algorithms like RK23 and RK45 can be implemented. Unfortunately, these optimizers do not perform well in practice~
This article discussed the very niche need of customizing optimizers, introducing the general way to write Keras optimizers as well as an "intrusive" approach. If you have such a special requirement, you can refer to these methods.
Through the study and analysis of Keras optimizers, we can further observe that the overall Keras codebase is extremely concise and elegant, making it hard to find fault with~