Do We Really Need to Reduce Training Set Loss to Zero?

By 苏剑林 | July 31, 2020

When training a model, do we need the loss function to be trained all the way down to 0? Obviously not. Generally speaking, we use a training set to train the model, but what we hope for is that the loss on the validation set is as small as possible. Normally, after the training set loss drops to a certain value, the validation set loss will start to rise. Therefore, there is no need to reduce the training set loss to 0.

That being the case, after a certain threshold has already been reached, can we do something else to improve model performance? The ICML 2020 paper "Do We Need Zero Training Loss After Achieving Zero Training Error?" answers this question. However, the paper's answer is limited only to the level of "what it is" and does not describe "why" very well. Additionally, after reading the interpretation by the expert kid丶 on Zhihu, I still didn't find the answer I wanted. Therefore, I analyzed it myself and recorded it here.

Description of Ideas

The solution provided by the paper is very simple. Suppose the original loss function is $\mathcal{L}(\theta)$, it is now modified to $\tilde{\mathcal{L}}(\theta)$:

\begin{equation}\tilde{\mathcal{L}}(\theta)=|\mathcal{L}(\theta) - b|+b\end{equation}

where $b$ is a pre-set threshold. When $\mathcal{L}(\theta) > b$, $\tilde{\mathcal{L}}(\theta)=\mathcal{L}(\theta)$; at this time, ordinary gradient descent is executed. However, when $\mathcal{L}(\theta) < b$, $\tilde{\mathcal{L}}(\theta)=2b-\mathcal{L}(\theta)$. Notice that the sign of the loss function has changed, so at this point, it is gradient ascent. Therefore, in general, using $b$ as the threshold, when the loss is below the threshold, the goal is instead to increase the loss function. The paper calls this modification "Flooding."

What effect does this have? The paper shows that in certain tasks, after the training set loss function is processed this way, the "Double Descent" phenomenon can appear in the validation set loss, as shown in the figure below:

Left: Training schematic without Flooding; Right: Training schematic with Flooding

Simply put, the final result on the validation set might be better. The experimental results from the original paper are as follows:

Flooding experimental results. The "F" in the third row represents Flooding, and the columns with red checkmarks are those with Flooding added.

Personal Analysis

How do we explain this method? One can imagine that after the loss function reaches $b$, the training process roughly alternates between executing gradient descent and gradient ascent. Intuitively, it seems that one step of ascent and one step of descent would just cancel each other out. Is this actually the case? Let's do a calculation and see. Suppose we first perform one step of descent followed by one step of ascent, with a learning rate of $\varepsilon$, then:

\begin{equation}\begin{aligned}&\theta_n = \theta_{n-1} - \varepsilon g(\theta_{n-1})\\ &\theta_{n+1} = \theta_n + \varepsilon g(\theta_n) \end{aligned}\end{equation}

where $g(\theta)=\nabla_{\theta}\mathcal{L}(\theta)$. Now we have

\begin{equation}\begin{aligned}\theta_{n+1} =&\ e \theta_{n-1} - \varepsilon g(\theta_{n-1}) + \varepsilon g\big(\theta_{n-1} - \varepsilon g(\theta_{n-1})\big)\\ \approx&\ \theta_{n-1} - \varepsilon g(\theta_{n-1}) + \varepsilon \big(g(\theta_{n-1}) - \varepsilon \nabla_{\theta} g(\theta_{n-1}) g(\theta_{n-1})\big)\\ =&\ \theta_{n-1} - \frac{\varepsilon^2}{2}\nabla_{\theta}\Vert g(\theta_{n-1})\Vert^2 \end{aligned}\end{equation}

The $\approx$ here uses the Taylor expansion to approximate the expansion of the loss function.

The final result is equivalent to gradient descent with a loss function consisting of a gradient penalty $\Vert g(\theta)\Vert^2=\Vert\nabla_{\theta}\mathcal{L}(\theta)\Vert^2$ and a learning rate of $\frac{\varepsilon^2}{2}$. More interestingly, if changed to "ascent first then descent," the expression remains the same (which reminds me of the story of "increasing the price by 10% then decreasing it by 10%" and "decreasing the price by 10% then increasing it by 10%"). Therefore, on average, the modification that Flooding makes to the loss function is equivalent to minimizing $\Vert\nabla_{\theta}\mathcal{L}(\theta)\Vert^2$ after ensuring that the loss function is small enough. This pushes the parameters toward a flatter region, which usually provides improved generalization performance (better resistance to perturbations), thus explaining to some extent why Flooding works.

In essence, this is not fundamentally different from adding random noise to parameters or adversarial training, etc., except that here perturbations are added only after ensuring the loss is small enough. Readers can refer to "Random Thoughts on Generalization: From Random Noise and Gradient Penalty to Virtual Adversarial Training" to understand related content, or refer to the "Regularization" section in Chapter 7 of Part II of the "Bible" Deep Learning.

Further Brainstorming

Readers who intend to use this method may worry about the choice of $b$. However, I have another idea: $b$ simply determines when to start alternating training. What if we use different learning rates for alternating training from the very beginning? That is, always execute

\begin{equation}\begin{aligned}&\theta_n = \theta_{n-1} - \varepsilon_1 g(\theta_{n-1})\\ &\theta_{n+1} = \theta_n + \varepsilon_2 g(\theta_n) \end{aligned}\end{equation}

where $\varepsilon_1 > \varepsilon_2$. This way, we remove $b$ (though we introduce the choice of $\varepsilon_1/\varepsilon_2$; there is no free lunch in this world). Repeating the above approximate expansion, we get

\begin{equation}\begin{aligned} \theta_{n+1} \approx&\ \theta_{n-1} - (\varepsilon_1 - \varepsilon_2) g(\theta_{n-1}) - \frac{\varepsilon_1\varepsilon_2}{2}\nabla_{\theta}\Vert g(\theta_{n-1})\Vert^2\\ =&\ \theta_{n-1} - (\varepsilon_1 - \varepsilon_2)\nabla_{\theta}\left[\mathcal{L}(\theta_{n-1}) + \frac{\varepsilon_1\varepsilon_2}{2(\varepsilon_1 - \varepsilon_2)}\Vert \nabla_{\theta}\mathcal{L}(\theta_{n-1})\Vert^2\right] \end{aligned}\end{equation}

This is equivalent to optimizing the loss function $\mathcal{L}(\theta) + \frac{\varepsilon_1\varepsilon_2}{2(\varepsilon_1 - \varepsilon_2)}\Vert\nabla_{\theta}\mathcal{L}(\theta)\Vert^2$ with a learning rate of $\varepsilon_1 - \varepsilon_2$ from start to finish. In other words, the gradient penalty is added from the beginning. Can this improve the model's generalization performance? I tried it briefly; in some cases, there is a slight improvement, and generally, there are no negative effects. Overall, it is not as good as directly adding the gradient penalty yourself, so I do not recommend doing it this way.

Note: Reader @xx205 provided the reference "Backstitch: Counteracting Finite-sample Bias via Negative Steps", which points out that this approach is effective in speech recognition. Therefore, my statement above might not be entirely complete; readers please test and verify for yourselves. Thanks again to reader @xx205 for the materials.

Conclusion

This article briefly introduced a training strategy proposed in an ICML 2020 paper—"gradient ascent after reaching a certain point"—and provided my own derivation and understanding. The results show that it is equivalent to a gradient penalty on the parameters, and gradient penalty is one of the common regularization methods.