"Make Keras a Little Cooler!": Layers-within-layers and Masking

By 苏剑林 | July 16, 2019

This edition of "Make Keras a Little Cooler!" will share two topics with readers: the first is "layers-within-layers," which as the name suggests, involves reusing existing layers when customizing layers in Keras to greatly reduce code volume; the other part, as requested by several readers, is an introduction to the principles and methods of masking in sequence models.

Layers-within-layers

In the article “Make Keras a Little Cooler!”: Exquisite Layers and Fancy Callbacks, we already introduced the basic methods for customizing Keras layers. The core steps are defining the build and call functions, where build is responsible for creating trainable weights and call defines the specific operations.

Avoid Redundant Labor

Readers who frequently use custom layers may feel they are doing redundant work. For example, if we want to add a linear transformation, we must add kernel and bias variables in build (manually defining initialization, regularization, etc.), then use K.dot in call, and sometimes consider dimension alignment issues. This process is tedious. In fact, a linear transformation is simply a Dense layer without an activation function. If we could reuse existing layers while customizing a layer, it would obviously save a lot of code.

As long as you are familiar with Python's object-oriented programming and carefully study the source code of Keras's Layer, it is not difficult to find a way to reuse existing layers. Here, I have organized it into a standardized workflow for readers to reference.

(Note: Starting from Keras 2.3.0, the layers-within-layers functionality is built-in; you can simply use Layer directly without the custom OurLayer below.)

OurLayer

First, we define a new OurLayer class:

class OurLayer(Layer):
    """Reuse existing layers in a custom layer
    """
    def reuse(self, layer, *args, **kwargs):
        if not layer.built:
            if len(args) > 0:
                input_shape = K.int_shape(args[0])
            else:
                input_shape = None
            layer.build(input_shape)
        
        # Manually add weights to ensure they are captured for training
        if hasattr(layer, 'trainable_weights'):
            for w in layer.trainable_weights:
                if w not in self._trainable_weights:
                    self._trainable_weights.append(w)
        if hasattr(layer, 'non_trainable_weights'):
            for w in layer.non_trainable_weights:
                if w not in self._non_trainable_weights:
                    self._non_trainable_weights.append(w)

        return layer(*args, **kwargs)

This OurLayer class inherits from the original Layer and adds a reuse method, allowing us to reuse existing layers.

Below is a simple example defining a layer with the following operation:

$$y = g(f(xW_1 + b_1)W_2 + b_2)$$

Here $f, g$ are activation functions. This is essentially a composition of two Dense layers. If we followed the standard approach, we would have to define several weights in build, determine shapes based on input, define initializations, etc. However, all these are already defined in the Dense layer. We can simply call them as follows:

class OurDense(OurLayer):
    """Inherit from OurLayer instead of Layer
    """
    def __init__(self, hidden_dim, output_dim,
                 hidden_activation='linear',
                 output_activation='linear', **kwargs):
        super(OurDense, self).__init__(**kwargs)
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.hidden_activation = hidden_activation
        self.output_activation = output_activation
    def build(self, input_shape):
        """Add the layers to be reused in the build method.
        Standard trainable weights can also be added here.
        """
        super(OurDense, self).build(input_shape)
        self.h_dense = Dense(self.hidden_dim, 
                             activation=self.hidden_activation)
        self.o_dense = Dense(self.output_dim, 
                             activation=self.output_activation)
    def call(self, inputs):
        """Simply reuse the layers; equivalent to o_dense(h_dense(inputs))
        """
        h = self.reuse(self.h_dense, inputs)
        o = self.reuse(self.o_dense, h)
        return o
    def compute_output_shape(self, input_shape):
        return input_shape[:-1] + (self.output_dim,)

Isn't that much cleaner?

Mask

In this section, we discuss the issues of padding and masking when processing variable-length sequences.

Prove You've Thought About It

Recently, in several models I've open-sourced, I've used masking extensively, which seemed to confuse many readers. While it's perfectly natural to have questions about new concepts, asking without thinking can be irresponsible. I believe that when asking someone a question, you should "prove" you have thought about it first. For example, if you want me to explain masking, I would first ask you to answer:

What does the sequence look like before masking? Which positions in the sequence changed after masking? How did they change?

These three questions have nothing to do with the "theory" of masking; they just require you to observe the operations being performed. Only after seeing what is happening can we discuss why we do it. If one cannot even understand the operation itself, there are two choices: give up on the problem, or study Keras for a few more months before coming back to it.

Assuming the reader has understood the masking operation, let's briefly discuss its basic principles.

Eliminating Padding

Masking appears alongside padding because neural networks require a structured tensor as input, while text sequences are usually variable in length. Consequently, we need to crop or fill (pad) them to make them fixed-length. By convention, we use 0 as the padding symbol.

Let's use a simple vector to describe the principle. Suppose we have a vector of length 5:

$$x = [1, 0, 3, 4, 5]$$

After padding to length 8, it becomes:

$$x = [1, 0, 3, 4, 5, 0, 0, 0]$$

When you feed this length-8 vector into a model, the model doesn't know whether this is a "length-8 vector" or a "length-5 vector filled with three meaningless zeros." To indicate which parts are meaningful and which are padding, we need a mask vector (matrix):

$$m = [1, 1, 1, 1, 1, 0, 0, 0]$$

This is a 0/1 vector (matrix), where 1 represents a meaningful part and 0 represents a meaningless padding part.

Masking refers to operations involving $x$ and $m$ to eliminate the effects of padding. For instance, if we want to calculate the mean of $x$, the expected result is:

$$\text{avg}(x) = \frac{1 + 0 + 3 + 4 + 5}{5} = 2.6$$

However, because the vector is padded, a direct calculation would yield:

$$\frac{1 + 0 + 3 + 4 + 5 + 0 + 0 + 0}{8} = 1.625$$

This introduces a bias. More seriously, the number of zeros might vary for the same input across different paddings, leading to different means for the same sample, which is unreasonable. With the mask vector $m$, we can rewrite the mean operation as:

$$\text{avg}(x) = \frac{\text{sum}(x \otimes m)}{\text{sum}(m)}$$

Here $\otimes$ denotes element-wise multiplication. In this way, the numerator only sums the non-padded parts, and the denominator counts the non-padded parts. No matter how many zeros you pad, the final result remains consistent.

What if we want the maximum value of $x$? We have $\max([1, 0, 3, 4, 5]) = \max([1, 0, 3, 4, 5, 0, 0, 0]) = 5$. It seems we don't need to eliminate the padding effect? This holds for this specific example, but consider:

$$x = [-1, -2, -3, -4, -5]$$

After padding, it becomes:

$$x = [-1, -2, -3, -4, -5, 0, 0, 0]$$

If you directly take the $\max$ of the padded $x$, the result is 0, which was not in the original range. The solution is to make the padded part so small that it (almost) can never be reached by the $\max$ operation, for example:

$$\max(x) = \max\left(x - (1 - m) \times 10^{10}\right)$$

Normally, the magnitude of a neural network's inputs and outputs isn't extremely large. After $x - (1 - m) \times 10^{10}$, the padding part becomes a very large negative number (on the order of $-10^{10}$), ensuring it won't be picked by the $\max$ operator.

Processing softmax with padding follows the same logic. In Attention mechanisms or pointer networks, we may encounter softmax over variable-length vectors. If we apply softmax to the padded vector directly, the padding part will also share the probability mass, resulting in the sum of probabilities for the meaningful parts being less than 1. The solution is identical to the $\max$ case: make the padding part so small that $e^x$ is close to 0, which can then be ignored:

$$\text{softmax}(x) = \text{softmax}\left(x - (1 - m) \times 10^{10}\right)$$

The mask handling for the above operators is somewhat special. For other operations (except bidirectional RNNs), the masking usually just requires outputting:

$$x \otimes m$$

Which simply keeps the padding part as 0.

Keras Implementation Points

Keras has built-in mask functionality, but I do not recommend using it. The built-in masking is not transparent, lacks flexibility, and doesn't support all layers. I strongly recommend readers implement masking themselves.

Several models I've open-sourced recently provide plenty of masking examples. I believe that by reading the source code carefully, you will easily understand how to implement masking. Here are a few key points. Generally, the input for NLP models is a word ID matrix with shape $[batch\_size, seq\_len]$. I use 0 as the ID for padding and 1 for UNK tokens, while the rest are arbitrary. Then, I use a Lambda layer to generate the mask matrix:

# Assuming input_tensor consists of word IDs, with 0 as padding
mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'))(input_tensor)

This generates a mask matrix of shape $[batch\_size, seq\_len, 1]$. After the word ID matrix passes through an Embedding layer, its shape becomes $[batch\_size, seq\_len, word\_size]$. Consequently, you can use the mask matrix to process the output. This is just my personal convention, not the only standard.

Integration: Bidirectional RNN

Our previous discussion excluded bidirectional RNNs. This is because RNNs are recursive models and cannot be masked simply (especially the backward RNN part). A bidirectional RNN performs forward and backward passes and then concatenates or adds the results. If we perform a backward RNN on $[1, 0, 3, 4, 5, 0, 0, 0]$, the final output will contain information from the padding zeros (because the zeros are involved in the calculation at the start of the backward pass). Therefore, it cannot be eliminated after the fact; it must be eliminated beforehand.

The solution is: when performing the backward RNN, first reverse $[1, 0, 3, 4, 5, 0, 0, 0]$ into $[5, 4, 3, 0, 1, 0, 0, 0]$ (Note: I actually mean reversing such that the meaningful sequence comes first, e.g., $[5, 4, 3, 0, 1]$... wait, let me rephrase). To make a backward RNN, you should reverse the meaningful part such that the sequence becomes $[5, 4, 3, 0, 1]$ followed by padding. Then perform a forward RNN, and then reverse the results back. Crucially, when reversing, you must only reverse the non-padded part. This ensures that the padding part never participates in the recursion and that the result remains aligned with the forward RNN. TensorFlow provides a ready-made function for this: tf.reverse_sequence().

Unfortunately, Keras's own Bidirectional wrapper does not have this functionality. Thus, I have rewritten it for reference:

class BiRNN(OurLayer):
    """A bidirectional RNN wrapper that properly handles masking for alignment.
    """
    def __init__(self, rnn, merge_mode='concat', **kwargs):
        super(BiRNN, self).__init__(**kwargs)
        self.forward_layer = rnn
        self.backward_layer = copy.copy(rnn)
        self.backward_layer.go_backwards = False
        self.merge_mode = merge_mode
    def build(self, input_shape):
        super(BiRNN, self).build(input_shape)
        # Build forward and backward RNN layers
    def call(self, inputs):
        x, mask = inputs # Consists of input and mask
        # Forward pass is normal
        y_f = self.reuse(self.forward_layer, x)
        # Backward pass: reverse meaningful sequence, compute RNN, then reverse back
        seq_len = K.cast(K.sum(mask, 1)[:, 0], 'int32')
        x_b = tf.reverse_sequence(x, seq_len, seq_dim=1)
        y_b = self.reuse(self.backward_layer, x_b)
        y_b = tf.reverse_sequence(y_b, seq_len, seq_dim=1)
        # Merge forward and backward results
        if self.merge_mode == 'concat':
            return K.concatenate([y_f, y_b])
        else:
            return y_f + y_b
    def compute_output_shape(self, input_shape):
        if self.merge_mode == 'concat':
            return input_shape[0][:-1] + (self.forward_layer.units * 2,)
        else:
            return input_shape[0]

Usage is almost the same as the built-in Bidirectional, except you need to pass the mask matrix as well, for example:

y = BiRNN(LSTM(64, return_sequences=True))([x, mask])

Summary

Keras is an extremely friendly and flexible high-level deep learning API wrapper. Do not believe the rumors claiming "Keras is friendly to beginners but lacks flexibility." Keras is friendly to beginners, even friendlier to experts, and best suited for users who need to customize modules frequently.