“Make Keras a Bit Cooler!”: Intermediate Variables, Weight Averaging, and Safe Generators

By 苏剑林 | April 28, 2019

By Su Jianlin | 2019-04-28

Continuing our journey to “Make Keras a Bit Cooler.”

Today, we will implement the flexible output of arbitrary intermediate variables using Keras, perform seamless weight moving averaging, and finally, introduce the process-safe way to write generators.

First is outputting intermediate variables. When customizing layers, we may want to inspect intermediate variables. Some of these requirements are relatively easy to implement, such as viewing the output of a specific layer—one simply needs to save the part of the model up to that layer as a new model. However, some requirements are more difficult, such as when using an Attention layer, where we might want to view the values of the Attention matrix; using the method of building a new model would be very cumbersome. This article provides a simple method to satisfy this requirement completely.

Next is weight moving average. Weight moving average is an effective method for stabilizing and accelerating model training and even improving model performance. Many large-scale models (especially GANs) almost always use weight moving average. Generally, weight moving average is part of the optimizer, so it usually requires rewriting the optimizer to implement it. This article introduces an implementation of weight moving average that can be seamlessly inserted into any Keras model without customizing the optimizer.

As for the process-safe way of writing generators, it is because Keras uses multi-processing when reading from generators. If the generator itself contains multi-processing operations, it might lead to exceptions, so this issue needs to be addressed.

Outputting Intermediate Variables

This section uses a basic model as an example:

x_in = Input(shape=(784,))
x = x_in

x = Dense(512, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(num_classes, activation='softmax')(x)

model = Model(x_in, x)

We will progressively introduce how to obtain Keras intermediate variables.

As a New Model

Suppose that after the model is trained, I want to obtain the output corresponding to x = Dense(256, activation='relu')(x). In that case, when defining the model, I can save the corresponding variable first and then redefine a model:

x_in = Input(shape=(784,))
x = x_in

x = Dense(512, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(256, activation='relu')(x)
y = x
x = Dropout(0.2)(x)
x = Dense(num_classes, activation='softmax')(x)

model = Model(x_in, x)
model2 = Model(x_in, y)

After completing the training of model, you can directly use model2.predict to view the corresponding 256-dimensional output. The prerequisite for this approach is that y must be the output of a certain layer; it cannot be an arbitrary tensor.

K.function!

Sometimes we categorize a more complex custom layer, a typical example being an Attention layer. We want to inspect some intermediate variables of the layer, such as the corresponding Attention matrix. This becomes quite troublesome. If we wanted to use the previous method, we would have to define the original Attention layer as two separate layers because, as mentioned before, when defining a new Keras model, the inputs and outputs must be inputs and outputs of Keras layers; they cannot be arbitrary tensors. Consequently, if you want to inspect multiple intermediate variables of a layer, you would have to constantly split the layer into multiple layers, which is clearly not user-friendly.

In fact, Keras provides an ultimate solution: K.function!

Before introducing K.function, let's write a simple example:

class Normal(Layer):
    def __init__(self, **kwargs):
        super(Normal, self).__init__(**kwargs)
    def build(self, input_shape):
        self.kernel = self.add_weight(name='kernel',
                                      shape=(1,),
                                      initializer='zeros',
                                      trainable=True)
        self.built = True
    def call(self, x):
        self.x_normalized = K.l2_normalize(x, -1)
        return self.x_normalized * self.kernel

x_in = Input(shape=(784,))
x = x_in

x = Dense(512, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.2)(x)
normal = Normal()
x = normal(x)
x = Dense(num_classes, activation='softmax')(x)

model = Model(x_in, x)

In the above example, Normal defines a layer where the output is self.x_normalized * self.kernel. However, I want to obtain the value of self.x_normalized after training is complete. It is related to the input and is not the output of a layer. Thus, the previous method cannot be used, but with K.function, it is just one line of code:

fn = K.function([x_in], [normal.x_normalized])

The usage of K.function is similar to defining a new model. You need to pass in all input tensors related to normal.x_normalized, but it does not require the output to be the output of a layer—it allows any tensor! The returned fn is an object that functions like a function, so you only need to call:

fn([x_test])

To obtain the x_normalized corresponding to x_test! This is much simpler and more universal than defining a new model.

In fact, K.function is one of the foundation functions of the Keras backend. It directly encapsulates the input and output operations of the backend. In other words, when using TensorFlow as the backend, fn([x_test]) is equivalent to:

sess.run(normal.x_normalized, feed_dict={x_in: x_test})

Therefore, the output of K.function allows for any tensor because it is essentially operating directly on the backend.

Weight Moving Average

Weight moving average is an effective method for providing training stability. Performance can be improved with almost zero additional cost through moving average. Weight moving average generally refers to "Exponential Moving Average," or EMA for short, because exponential decay is usually used as the proportion of weight in moving averages. It has been accepted by mainstream models, especially GANs. In many GAN papers, we typically see descriptions like:

we use an exponential moving average with decay 0.999 over the weight ...

This means the GAN model uses EMA. Furthermore, ordinary models use it too; for example, "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension" used EMA during the training process with a decay rate of 0.9999.

Format of Moving Average

The format of weight moving average is actually very simple: assume each update of the optimizer is: \begin{equation}\boldsymbol{\theta}_{n+1} = \boldsymbol{\theta}_n - \Delta \boldsymbol{\theta}_n \end{equation} where $\Delta \boldsymbol{\theta}_n$ is the update brought by the optimizer, which can be any kind like SGD or Adam. Weight moving average, on the other hand, maintains a new set of variables $\boldsymbol{\Theta}$: \begin{equation}\boldsymbol{\Theta}_{n+1} = \alpha \boldsymbol{\Theta}_n + (1-\alpha) \boldsymbol{\theta}_{n+1}\end{equation} where $\alpha$ is a positive constant close to 1, called the "decay rate."

Weight moving average is also called Polyak averaging. Note that although it is somewhat similar in form, it is not the same as momentum acceleration: EMA does not change the trajectory of the original optimizer. Wherever the original optimizer went, it still goes the same way, but it maintains a set of new variables that average the trajectory of the original optimizer; momentum acceleration, however, changes the trajectory of the original optimizer.

To re-emphasize, weight moving average does not change the direction of the optimizer; it simply takes the points on the optimization trajectory, averages them, and uses that as the final model weights.

Regarding the principle and effectiveness of weight moving average, you can further refer to the article "Optimization Algorithms from a Dynamical Perspective (IV): The Third Stage of GANs".

Clever Injection Implementation

The key to implementing EMA is how to introduce a set of average variables based on the original optimizer and execute the update of average variables after each parameter update. This requires a certain understanding of the Keras source code and its implementation logic.

The reference implementation provided here is as follows:

class ExponentialMovingAverage:
    """Perform exponential moving average on model weights.
    Usage: After model.compile and before the first training;
    First initialize the object, then execute the inject method.
    """
    def __init__(self, model, momentum=0.9999):
        self.momentum = momentum
        self.model = model
        self.ema_weights = [K.zeros(K.shape(w)) for w in model.weights]
    def inject(self):
        """Add update operators to model.metrics_updates.
        """
        self.initialize()
        for w1, w2 in zip(self.ema_weights, self.model.weights):
            op = K.moving_average_update(w1, w2, self.momentum)
            self.model.metrics_updates.append(op)
    def initialize(self):
        """EMA weights initialization is consistent with the original model initialization.
        """
        self.old_weights = K.batch_get_value(self.model.weights)
        K.batch_set_value(zip(self.ema_weights, self.old_weights))
    def apply_ema_weights(self):
        """Back up the original model weights, then apply average weights to the model.
        """
        self.old_weights = K.batch_get_value(self.model.weights)
        ema_weights = K.batch_get_value(self.ema_weights)
        K.batch_set_value(zip(self.model.weights, ema_weights))
    def reset_old_weights(self):
        """Restore the model to legacy weights.
        """
        K.batch_set_value(zip(self.model.weights, self.old_weights))

Usage is very simple:

EMAer = ExponentialMovingAverage(model) # Execute after model.compile
EMAer.inject() # Execute after model.compile

model.fit(x_train, y_train) # Train the model

After training is complete:

EMAer.apply_ema_weights() # Apply EMA weights to the model
model.predict(x_test) # Perform prediction, verification, saving, etc.

EMAer.reset_old_weights() # Before continuing training, restore the model's old weights. Again, EMA does not affect the model's optimization trajectory.
model.fit(x_train, y_train) # Continue training

Reviewing the implementation process, the main point is the introduction of the K.moving_average_update operation and its insertion into model.metrics_updates. During the training process, the model reads and executes all operators in model.metrics_updates, thereby completing the moving average.

Process-Safe Generators

Generally, when training data cannot be fully loaded into memory or needs to be generated dynamically, a generator is used. Typically, the way to write a Keras model generator is:

def data_generator():
    while True:
        x_train = something
        y_train = otherthing
        yield x_train, y_train

But if something or otherthing contains multi-processing operations, problems may arise. In such cases, there are two solutions. One is to set the parameters use_multiprocessing=False, worker=0 during fit_generator; the other method is to write the generator by inheriting from the keras.utils.Sequence class.

Official Reference Example

The official introduction to the keras.utils.Sequence class is here. The official emphasis is:

Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

In short, it is safe for multi-processing and can be used with confidence. The example provided by the official documentation is as follows:

from skimage.io import imread
from skimage.transform import resize
import numpy as np

# Here, `x_set` is list of path to the images
# and `y_set` are the associated classes.

class CIFAR10Sequence(Sequence):

    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]

        return np.array([
            resize(imread(file_name), (200, 200))
            for file_name in batch_x]), np.array(batch_y)

Simply define the __len__ and __getitem__ methods according to the format. The __getitem__ method directly returns one batch of data.

bert as service Example

I first discovered the necessity of Sequence while experimenting with bert as service. bert as service is a service component developed by Xiao Han for quickly obtaining BERT encoding vectors. I once tried to use it to get character vectors and then pass them into Keras for training, but I found that it would always get stuck during training.

After searching, it was confirmed to be a conflict between the multi-processing of Keras's fit_generator and the built-in multi-processing of bert-as-service. The specifics of the conflict are somewhat vague to me, so I won't delve into it. However, a reference solution provided here uses a generator written by inheriting from the Sequence class.

(P.S.: For calling bert as service, later Xiao Han provided a coroutine version ConcurrentBertClient, which can replace the original BertClient, so there will be no problems even with the original generator.)

Keras as a Breath of Fresh Air

In my eyes, Keras is a breath of fresh air among deep learning frameworks, much like Python is a breath of fresh air among all programming languages. Implementing what you need with Keras is like an enjoyable experience every single time.