"Make Keras Cooler!": Layer-wise Learning Rates and Free Gradient Manipulation

By 苏剑林 | March 10, 2019

Raising the banner of "Make Keras Cooler!" to unlock the infinite possibilities of Keras~

Today, we will accomplish two very important things with Keras: setting layer-wise learning rates and flexibly manipulating gradients.

First is layer-wise learning rates. The utility of this is obvious—for example, when fine-tuning an existing model, sometimes we want to freeze certain layers, but other times we don't want to freeze them entirely; instead, we want them to update with a lower learning rate than other layers. This requirement leads us to layer-wise learning rates. There has been some discussion online about implementing this in Keras, but the conclusions usually suggest rewriting the optimizer. Clearly, that approach is unfriendly in terms of both implementation and usage.

Next is gradient manipulation. A direct example of manipulating gradients is gradient clipping, where gradients are kept within a certain range; Keras has this built-in. However, Keras provides global gradient clipping. What if I want to set different clipping methods for each gradient? Or what if I have other ideas for gradient manipulation? Do I have to rewrite the optimizer again?

This article aims to provide the simplest possible solutions to these problems.

Layer-wise Learning Rates

While rewriting the optimizer to set layer-wise learning rates is feasible, it is too much trouble. To seek a simpler solution, we need some mathematical knowledge to guide us.

Optimization Under Parameter Transformation

First, let's consider the update formula for Stochastic Gradient Descent (SGD):

\begin{equation}\boldsymbol{\theta}_{n+1}=\boldsymbol{\theta}_{n}-\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}\label{eq:sgd-1}\end{equation}

where $L$ is the loss function with parameters $\boldsymbol{\theta}$, $\alpha$ is the learning rate, and $\frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}$ is the gradient, sometimes written as $\nabla_{\boldsymbol{\theta}} L(\boldsymbol{\theta}_{n})$. Notation is flexible; the key is understanding its meaning.

Now, let's consider the transformation $\boldsymbol{\theta}=\lambda \boldsymbol{\phi}$, where $\lambda$ is a fixed scalar and $\boldsymbol{\phi}$ is the parameter. If we optimize $\boldsymbol{\phi}$ instead, the corresponding update formula is:

\begin{equation}\begin{aligned}\boldsymbol{\phi}_{n+1}=&\boldsymbol{\phi}_{n}-\alpha \frac{\partial L(\lambda\boldsymbol{\phi}_{n})}{\partial \boldsymbol{\phi}_n}\\ =&\boldsymbol{\phi}_{n}-\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}\frac{\partial \boldsymbol{\theta}_{n}}{\partial \boldsymbol{\phi}_n}\\ =&\boldsymbol{\phi}_{n}-\lambda\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}\end{aligned}\end{equation}

The second equality is simply the chain rule. Now, if we multiply both sides by $\lambda$, we get:

\begin{equation}\lambda\boldsymbol{\phi}_{n+1}=\lambda\boldsymbol{\phi}_{n}-\lambda^2\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}\quad\Rightarrow\quad\boldsymbol{\theta}_{n+1}=\boldsymbol{\theta}_{n}-\lambda^2\alpha \frac{\partial L(\boldsymbol{\theta}_{n})}{\partial \boldsymbol{\theta}_n}\label{eq:sgd-2}\end{equation}

Comparing $\eqref{eq:sgd-1}$ and $\eqref{eq:sgd-2}$, do you see what I'm getting at?

In an SGD optimizer, if we perform the parameter transformation $\boldsymbol{\theta}=\lambda \boldsymbol{\phi}$, the equivalent result is that the learning rate changes from $\alpha$ to $\lambda^2\alpha$.

However, for adaptive learning rate optimizers (such as RMSprop, Adam, etc.), the situation is slightly different because the adaptive learning rate uses the gradient (in the denominator) to adjust the step size, which cancels out one $\lambda$. Thus (interested readers are encouraged to derive this themselves):

In adaptive learning rate optimizers like RMSprop and Adam, if we perform the parameter transformation $\boldsymbol{\theta}=\lambda \boldsymbol{\phi}$, the equivalent result is that the learning rate changes from $\alpha$ to $\lambda\alpha$.

Adjusting Learning Rates via "Grafting"

With these two conclusions, we only need to find a way to implement the parameter transformation without rewriting the optimizer to achieve layer-wise learning rates.

Implementing parameter transformation isn't difficult. We previously discussed a method in the Weight Normalization section of "Make Keras Cooler!": Arbitrary Outputs and Flexible Normalization. Since Keras separates the build and call steps when constructing a layer, we can insert operations after build and before call is invoked.

Below is an encapsulated implementation:

import keras.backend as K

class SetLearningRate:
    """A wrapper for layers, used to set the learning rate of the current layer
    """

    def __init__(self, layer, lamb, is_ada=False):
        self.layer = layer
        self.lamb = lamb # Learning rate multiplier
        self.is_ada = is_ada # Whether an adaptive learning rate optimizer is used

    def __call__(self, inputs):
        with K.name_scope(self.layer.name):
            if not self.layer.built:
                input_shape = K.int_shape(inputs)
                self.layer.build(input_shape)
                self.layer.built = True
                if self.layer._initial_weights is not None:
                    self.layer.set_weights(self.layer._initial_weights)
            for key in ['kernel', 'bias', 'embeddings', 'depthwise_kernel', 'pointwise_kernel', 'recurrent_kernel', 'gamma', 'beta']:
                if hasattr(self.layer, key):
                    weight = getattr(self.layer, key)
                    if self.is_ada:
                        lamb = self.lamb # Adaptive optimizers keep the lamb ratio directly
                    else:
                        lamb = self.lamb**0.5 # SGD (including momentum), lamb needs a square root
                    K.set_value(weight, K.eval(weight) / lamb) # Modify initialization
                    setattr(self.layer, key, weight * lamb) # Replace by scale
            return self.layer(inputs)

Usage example:

x_in = Input(shape=(None,))
x = x_in

# Normally: x = Embedding(100, 1000, weights=[word_vecs])(x)
# The following indicates: Later, an adaptive learning rate optimizer will be used, 
# and the Embedding layer will update at 1/10th of the global learning rate.
# word_vecs are pre-trained word vectors.
x = SetLearningRate(Embedding(100, 1000, weights=[word_vecs]), 0.1, True)(x)

# Imagine the rest of the model...
x = LSTM(100)(x)

model = Model(x_in, x)
model.compile(loss='mse', optimizer='adam') # Optimized with an adaptive optimizer

A few points to note:

1. Currently, this method can only be used when building the model from scratch; it cannot be applied to an already built model.

2. If there are pre-trained weights, there are two ways to load them. The first is to pass them via the weights parameter when defining the layer (as in the example). The second is to use model.set_weights(weights) after the model is built (with SetLearningRate already inserted), where weights are the original pre-trained weights "already divided by $\lambda$ or $\sqrt{\lambda}$ at the SetLearningRate positions."

3. The second method for loading weights might sound confusing, but if you understand the theory in this section, you should know what I mean. Since the learning rate adjustment is achieved via weight * lamb, the initialization of the weight must become weight / lamb.

4. This operation is essentially irreversible. For example, if you initially set the Embedding layer to update at 1/10th of the global learning rate, it is very difficult to change it to 1/5th or another ratio later. (Of course, if you truly master the principles and the weight loading logic, you could figure it out, but by then you'd likely have implemented your own solution anyway).

5. These limitations exist because we want to avoid modifying or rewriting the optimizer. If you decide to modify the optimizer yourself, please refer to "Make Keras Cooler!": Niche Custom Optimizers.

Free Gradient Manipulation

In this section, we will learn more flexible control over gradients. This involves modifying the optimizer, but it does not require rewriting it entirely.

Structure of Keras Optimizers

To modify an optimizer, one must first understand its structure. We took a brief look in "Make Keras Cooler!": Niche Custom Optimizers; let's revisit it now.

The code for Keras optimizers is at: https://github.com/keras-team/keras/blob/master/keras/optimizers.py

Examining any optimizer reveals that to customize one, you only need to inherit the Optimizer class and define the get_updates method. However, we don't want to create a new optimizer; we just want to control the gradients. We can see that gradient acquisition actually happens in the get_gradients method of the parent Optimizer class:

 def get_gradients(self, loss, params):
     grads = K.gradients(loss, params)
     if None in grads:
         raise ValueError('An operation has `None` for gradient. '
                          'Please make sure that all of your ops have a '
                          'gradient defined (i.e. are differentiable). '
                          'Common ops without gradient: '
                          'K.argmax, K.round, K.eval.')
     if hasattr(self, 'clipnorm') and self.clipnorm > 0:
         norm = K.sqrt(sum([K.sum(K.square(g)) for g in grads]))
         grads = [clip_norm(g, self.clipnorm, norm) for g in grads]
     if hasattr(self, 'clipvalue') and self.clipvalue > 0:
         grads = [K.clip(g, -self.clipvalue, self.clipvalue) for g in grads]
     return grads

The first line of the method acquires the raw gradients, and the subsequent code provides two types of gradient clipping. It's not hard to imagine that by overriding the get_gradients method, we can perform any operation on the gradients without affecting the update steps of the optimizer (i.e., without affecting get_updates).

Everything is an Object: Just Overwrite It

How can we modify only the get_gradients method? This is thanks to the Python philosophy—"Everything is an object." Python is an object-oriented language where almost every variable you encounter is an object. While we say get_gradients is a method of the optimizer, it is also an attribute (an object) of the optimizer. Since it is an attribute, we can simply overwrite it.

Let's look at a very "brute-force" example (a prank):

def our_get_gradients(loss, params):
    return [K.zeros_like(p) for p in params]

adam_opt = Adam(1e-3)
adam_opt.get_gradients = our_get_gradients

model.compile(loss='categorical_crossentropy',
              optimizer=adam_opt)

This example is actually quite boring—it sets all gradients to zero (meaning the model won't move no matter how much you optimize it). However, this "prank" is representative: if you can set all gradients to zero, you can also perform any operation you like. For instance, you could clip gradients according to the $L_1$ norm instead of the $L_2$ norm, or make other adjustments.

What if I only want to manipulate gradients for specific layers? That's simple too. You need to give those layers distinguishable names when defining them, and then perform different operations based on the names in params. Once you reach this step, I believe "one method opens a thousand doors."

The Elegance of Keras

Perhaps in the eyes of many, Keras is just a user-friendly but "rigidly" encapsulated high-level framework. But in my eyes, I see only its infinite flexibility.

It is an impeccable encapsulation.