By 苏剑林 | Nov 22, 2021
Dropout is a classic approach to prevent overfitting, and most readers are likely familiar with it. Interestingly, Dropout has recently seen something of a "second spring," with new and intriguing variations emerging. For instance, SimCSE and R-Drop have recently sparked heated discussions. In the article "Dropout Twice Again! This Time It Achieved SOTA in Supervised Tasks", we discovered that simple R-Drop can even rival adversarial training, which is quite surprising.
Generally, Dropout is added to the output of each layer or to the model parameters—these are the two classic uses of Dropout. However, I recently learned about a novel use from the paper "Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning": adding it to the gradients.
Adding Dropout to gradients? Most readers have probably never heard of this. So, how effective is it? Let's take a closer look.
In short, this paper proposes an approach called "ChildTuning" to improve the performance of pre-trained models during fine-tuning. "Child" stands for "Children Network," referring to selecting a sub-network from the pre-trained model for optimization, thereby mitigating the risk of overfitting caused by optimizing the entire model. Regarding sub-network selection, there are two methods: ChildTuning-D and ChildTuning-F.
ChildTuning-D (Task-Dependent) is a task-related selection method that requires the downstream task's training data for calculation. Specifically, suppose the training data is $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$, and the model is $p(y|x; \theta)$, where $\theta$ represents all parameters of the model and $\theta_i$ is the $i$-th parameter. We calculate parameter importance using the following form of Fisher Information:
\begin{equation}F_i = \frac{1}{n}\sum_{j=1}^n \left(\frac{\partial \log p(y_j|x_j;\theta)}{\partial\theta_i}\right)^2\end{equation}After obtaining the importance metrics, we can rank each parameter and select the top-$p$ most important ones (e.g., the top 20%, where $p=0.2$). During model updates, only these parameters are optimized. Since fewer parameters are updated, the risk of overfitting is reduced. In practice, ChildTuning-D determines the parameters to be optimized before fine-tuning and keeps them fixed thereafter.
Note that parameter selection here is done on a per-component basis. That is, within a single parameter matrix, only some parts might be selected. Thus, rather than selecting which parameter matrices to optimize, a corresponding 0/1 mask matrix $M$ is constructed to mask out the gradients: $g \leftarrow g \otimes M / p$, where dividing by $p$ maintains the overall scale of the updates. The gradients for parameters not selected remain 0, so they are never updated. While the number of parameters being updated is reduced in theory, this method does not save on computational costs; the authors position it solely as a method to improve fine-tuning performance.
ChildTuning-F (Task-Free) is a task-independent selection method, which can be more vividly described as "Gradient Dropout." Unlike ChildTuning-D, which uses task data to construct a fixed 0/1 matrix $M$ and modifies the gradient to $g \otimes M / p$, ChildTuning-F aims to be task-independent by randomly constructing a 0/1 matrix $M$ at each update step, where the proportion of 1s is $p$, and then modifying the gradient to $g \otimes M / p$. This is effectively applying Dropout to the gradients.
It is important to note that if the current gradient of a parameter is 0, it doesn't necessarily mean the current update amount for that parameter is 0. This is because we typically use optimizers with momentum, such as SGDM and Adam. For these optimizers, the update amount is proportional to the momentum, which is a moving average of historical gradients: $m_t = \beta m_{t-1} + (1-\beta)g_t$. If the historical gradients of a parameter were not 0, the momentum (and thus the update) will likely remain non-zero even if the current gradient is 0.
This raised a question for me: based on the design philosophy of ChildTuning, it seems intended to select a sub-network for update at each step—meaning only a $p$ proportion of parameters should be updated. However, based on the analysis above, applying Gradient Dropout doesn't actually achieve this. To achieve that goal, one should apply Dropout to the per-step update amount $\Delta\theta$. Yet, I repeatedly checked the original paper and even cross-referenced the open-source code; the authors definitely mean applying Dropout to the gradients.
From the experimental results provided in the paper, ChildTuning's "track record" is quite stellar, with improvements in almost every case, some reaching as high as 8%—

ChildTuning's "Track Record" - 1

ChildTuning's "Track Record" - 2
From the tables, we can see that ChildTuning-D achieved improvements in almost all tasks, while ChildTuning-F was also effective in many tasks. Furthermore, the paper clarifies that the results shown are for the large versions of the models. In private communication with the authors, they mentioned that the base versions also show improvements, though they were omitted for brevity.
ChildTuning-D ranks parameters based on Fisher Information, an approach with a long history; its effectiveness is not surprising. Similar works include "Training Neural Networks with Fixed Sparse Masks". In contrast, the performance of the task-independent ChildTuning-F—Gradient Dropout—is quite interesting and warrants deeper thought.
Coincidentally, there was another paper on Gradient Dropout last year titled "Regularizing Meta-Learning via Gradient Dropout". This suggests that Gradient Dropout indeed has some utility. Why does it work?
The original paper provides an explanation based on SGD, stating that Gradient Dropout can expand the variance during the update process, thereby helping the model escape from sub-optimal local minima.
Specifically, since we use SGD, the gradient calculated at each step has some randomness. Assume it follows a Gaussian distribution with mean $\mu$ and variance $\sigma^2$. For ChildTuning-F, we introduce a random variable $\varepsilon$ that is 1 with probability $p$ and 0 with probability $1-p$. Then we have:
\begin{equation}\begin{aligned}&\mathbb{E}[g\varepsilon/p]=\mathbb{E}[g]\mathbb{E}[\varepsilon]/p=\mu \\ &\mathbb{E}[(g\varepsilon/p)^2]=\mathbb{E}[g^2]\mathbb{E}[\varepsilon^2]/p^2 = (\mu^2+\sigma^2)/p \end{aligned}\end{equation}Therefore:
\begin{equation}\mathbb{V}ar[g\varepsilon/p] = \mathbb{E}[(g\varepsilon/p)^2] - \mathbb{E}[g\varepsilon/p]^2=\sigma^2 + \frac{1-p}{p}(\mu^2+\sigma^2) > \sigma^2\end{equation}In other words, Gradient Dropout maintains the mean of the gradient while expanding the variance. In SGD, the update amount is proportional to the gradient; thus, Gradient Dropout expands the variance of the update amount. The paper argues this helps the model reach better convergence results.
This explanation seems reasonable and aligns with many people's intuition, as many subconsciously believe SGD is better than full-batch gradient descent because of the noise. However, upon closer inspection, this explanation is somewhat "misaligned."
The reason is simple: the analysis above applies to SGD, but in NLP, we almost always use Adam (or its variants). Does the conclusion still hold for Adam? Unfortunately, no—it's actually the opposite. In Adam, in the long term, the update amount can be approximated as ($\eta$ is the learning rate):
\begin{equation}\Delta\theta = \eta\frac{\mathbb{E}[g]}{\sqrt{\mathbb{E}[g^2]}}\end{equation}With Gradient Dropout, the update amount becomes:
\begin{equation}\eta\frac{\mathbb{E}[g\varepsilon/p]}{\sqrt{\mathbb{E}[(g\varepsilon/p)^2]}}=\eta\sqrt{p}\frac{\mathbb{E}[g]}{\sqrt{\mathbb{E}[g^2]}}\end{equation}As we can see, in the long run, adding Gradient Dropout to Adam is equivalent simply to reducing the learning rate by a factor of $\sqrt{p}$! Moreover, because the learning rate is reduced—and thus the update amount—the variance of the update amount also decreases. In other words, if you use the Adam optimizer, the actual situation is exactly the opposite of the paper's explanation: the variance of the update amount not only fails to increase but actually decreases.
The root cause of this phenomenon is that when using an optimizer with a moving average, the update amount is no longer proportional to the gradient. Consequently, gradient changes do not necessarily correspond to update amount changes. This brings us back to my earlier question: why not apply Dropout directly to the update amount? If it were Update Dropout, the derivation based on SGD could be applied directly.
However, I believe that even if the optimizer were limited to SGD, or if Dropout were applied directly to the update amount, the paper's derivation still wouldn't fully explain its effectiveness. The reason is simple: there are too many operations that achieve "unchanged mean, expanded variance"—for example, just adding Gaussian noise to the gradient. Is it likely that all such operations achieve the same effect? It seems improbable. I believe the explanation for Gradient Dropout or Update Dropout must focus on the sparsity brought by Dropout.
On this point, I am reminded of an article I wrote previously: "Optimization Algorithms from a Dynamic Perspective (VII): SGD ≈ SVM?". That article tells us that the solution of any model trained with SGD is essentially similar to an SVM model:
\begin{equation}f_{\theta_T}(x) = \beta(x) + \sum_i \alpha_i (x) K(x, x_i)\end{equation}Where $x_i$ is the $i$-th training sample. What are its characteristics? $K(x, x_i)$ acts like a "similarity function." The above form implies that the model effectively "memorizes" the training set and, during prediction, retrieves samples from the training set based on their similarity $K(x, x_i)$ to provide a result. Of course, this is a conceptual explanation; we don't actively design the model this way. However, from this perspective, we see that gradient descent is essentially memorizing samples and providing predictions in a manner similar to KNN. This makes it easy to understand the conclusion that "more training samples usually lead to better results."
Returning to ChildTuning-F: we sample a batch and apply Dropout to the resulting gradients or updates. Combined with the "memorization" explanation, we can intuitively imagine that this is essentially "using only a small portion of the parameters to memorize a small portion of the samples," rather than using all parameters to memorize every batch. Thus, this is similar to the principle of "not putting all your eggs in one basket." By distributing samples more evenly across the parameters, the risk of overfitting is reduced.
For ChildTuning-F, if you know how to modify an optimizer, implementing either Gradient Dropout or Update Dropout is just one line of code. It's worth a try. What if it actually works?
I conducted tests on several CLUE tasks. The results are shown in the table below. The baseline code comes from "With bert4keras in Hand, I Have the Baseline: CLUE Benchmark Code". "grad drop" refers to Dropout on gradients, and "incre drop" refers to Dropout on the update amount (increments). Green indicates an improvement over the baseline, and red indicates a decline. Due to limited time and computing power, all results are from a single run and are subject to random fluctuation.
$$\begin{array}{c} \text{Comparison Experiment on CLUE Classification Tasks (Validation Set)} \\ {\begin{array}{c|ccccccc} \hline & \text{IFLYTEK} & \text{TNEWS} & \text{AFQMC} & \text{OCNLI} & \text{CMNLI} & \text{WSC} & \text{CSL} \\ \hline \text{BERT} & 60.06 & 56.80 & 72.41 & 73.93 & 79.56 & 78.62 & 83.93 \\ \text{BERT}_{\text{-grad drop}} & \color{green}{60.56} & \color{green}{56.97} & \color{red}{72.13} & \color{green}{74.88} & \color{green}{80.09} & \color{red}{75.99} & \color{red}{83.83} \\ \text{BERT}_{\text{-incre drop}} & \color{red}{59.99} & \color{red}{56.78} & \color{green}{72.66} & \color{green}{74.51} & \color{red}{79.36} & \color{red}{77.30} & \color{green}{84.20} \\ \hline \text{RoBERTa} & 60.64 & 58.06 & 74.05 & 76.00 & 81.24 & 87.50 & 84.50\\ \text{RoBERTa}_{\text{-grad drop}} & \color{green}{60.72} & \color{red}{57.91} & \color{red}{74.03} & \color{red}{75.19} & \color{red}{80.52} & \color{red}{84.54} & \color{green}{84.73}\\ \text{RoBERTa}_{\text{-incre drop}} & \color{green}{60.87} & \color{red}{57.99} & \color{red}{74.03} & \color{red}{75.97} & \color{red}{81.02} & \color{red}{84.87} & \color{green}{84.73}\\ \hline \end{array}} \end{array}$$From the table, we can broadly observe:
1. Gradient Dropout and Update Dropout are generally comparable, each with its own pros and cons.
2. The effect is more noticeable on BERT, while there is almost no effect on RoBERTa. This aligns with the English experimental results provided in the paper.
This result is somewhat frustrating. One cannot say it's ineffective, but normally, who would choose BERT—which has the same speed but worse results—over the better-performing RoBERTa? If RoBERTa doesn't really work with this, it seems there might not be much value in trying it. Of course, the paper showed the largest gains on Electra, which I haven't tested. Interested readers are welcome to try it and let me know the results.
Additionally, I didn't have much interest in ChildTuning-D, and its implementation is slightly more complex, so I didn't test it. Readers who have experimented with it are also welcome to share their findings.
This article introduced the practice of adding Dropout to gradients to improve fine-tuning performance and provided a personal theoretical analysis. Overall, my feeling is: it's worth a try and might be effective, but don't set your expectations too high.