[Memo] Talking about Dropout

By 苏剑林 | August 08, 2017

In fact, this is just a memo...

Dropout is an effective measure to prevent overfitting in deep learning. Of course, in terms of its underlying philosophy, dropout is not just limited to deep learning; it can also be used in traditional machine learning methods. It just appears more natural within the neural network framework of deep learning.

What does it do?

How does dropout operate? Generally speaking, for an input tensor $x$, dropout involves setting some elements to zero and then performing a scale transformation on the result. Specifically, taking Keras's Dropout(0.6)(x) as an example, it is essentially equivalent to what follows in numpy:

That is to say, 60% of the elements are set to 0, and the remaining 40% of the elements are scaled up to $1/40\% = 2.5$ times their original value. It is worth noting that the 0.6 in Keras's Dropout(0.6)(x) represents the dropout rate (the proportion to discard), whereas in TensorFlow, the 0.6 in tf.nn.dropout(x, 0.6) represents the keep probability (the proportion to retain). You need to analyze the specific framework for its exact meaning (though if the dropout rate is 0.5, you don't really need to worry about it ^_^).

What is its use?

Generally, we understand dropout as a "low-cost ensemble strategy," which is correct. The specific process can be understood as follows.

After the zeroing operation mentioned above, we can consider that the zeroed portion has been discarded, and some information has been lost. However, even though information is lost, life must go on—no, I mean, training must go on—so the model is forced to use the remaining information to fit the target. Since each dropout is random, the model cannot rely solely on certain nodes. Therefore, overall, the model is forced to learn using a small amount of features each time. Because the features being learned change every time, it means that every feature should contribute to the model's prediction (rather than favoring certain features, which leads to overfitting).

Finally, during prediction, dropout is not applied. This is equivalent to averaging all local features (finally using all the information), and theoretically, the performance improves, and overfitting becomes less severe (because the risk is spread across every feature rather than just a subset of features).

More flexibility

Sometimes we want to apply constraints to the way dropout is performed. For example, within the same batch, I might want all samples to be dropped out in the same way, rather than the first sample dropping the first node and the second sample dropping the second node. Or, when using an LSTM for text classification, I might want to dropout by word vector—meaning either the entire word vector is dropped or the entire word vector is kept, rather than just dropping certain elements of a word vector. Another example: when using a CNN for RGB image classification, I might want to dropout by channel—dropping any one of the R, G, or B channels for each image (much like a color transformation or RGB perturbation), rather than just dropping specific pixels on the image (which is equivalent to adding noise to the image).

To implement these requirements, one needs to use the noise_shape parameter in dropout. This parameter exists in both Keras's Dropout layer and tf.nn.dropout (I only use these two frameworks), and its meaning is the same in both. However, this parameter is rarely mentioned online, and even when it is, the explanation is often vague. Let's take tf.nn.dropout(x, 0.5, noise_shape) as an example. First, noise_shape is a 1D tensor—simply put, it's a 1D array (can be a list or tuple) whose length must be the same as x.shape. Furthermore, the elements in noise_shape can only be either 1 or the corresponding element in x.shape. For example, if x.shape = (3, 4, 5), then there are only 8 allowed configurations for noise_shape:

(3,4,5), (1,4,5), (3,1,5), (3,4,1), (1,1,5), (1,4,1), (3,1,1), (1,1,1)

What is the meaning of each? It can be understood this way: Whichever axis is set to 1, that axis will be dropped out consistently. For example, (3,4,5) is normal dropout without any constraints. (1,4,5) means every sample in the batch will be dropped out in the same way (this can be understood as applying identical noise to every sample). If (3,4,5) represents (number of sentences, number of words per sentence, word vector dimension), then (3,4,1) means dropping out by word vector (which can be understood as randomly skipping certain words).

For some readers, understanding this via numpy code might be more intuitive:

If you found this article helpful, you are welcome to share or donate to this article. Donations are not for profit, but to let us know how much sincere attention Scientific Spaces has received from its readers. Of course, if you ignore it, it will not affect your reading. Welcome and thank you once again!