L2 Regularization is Not as Good as Imagined? It Might Be the Fault of "Weight Scale Shifting"

By 苏剑林 | August 14, 2020

By Su Jianlin | August 14, 2020

L2 regularization is a commonly used method in machine learning to prevent overfitting (and is likely a frequent interview question). Simply put, it aims to keep the magnitude of weights as small as possible so that the model can withstand more perturbations, ultimately improving its generalization performance. However, readers may also find that the performance of L2 regularization is often not as good as theory suggests; in many cases, adding it might even have a negative effect. A recent paper, "Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations", analyzes the drawbacks of L2 regularization from the perspective of "Weight Scale Shifting" and proposes a new WEISSI regularization term. The entire analysis process is quite interesting, and I would like to share it with you here.

Related Content

In this section, we first briefly review L2 regularization, then introduce its connection to weight decay and the related AdamW optimizer.

Understanding L2 Regularization

Why add L2 regularization? There are multiple answers to this question. Some answer from the perspective of Ridge regression, others from Bayesian inference. Here, we provide an understanding from the perspective of perturbation sensitivity.

For two (column) vectors $\boldsymbol{w}, \boldsymbol{x}$, we have the Cauchy-Schwarz inequality $|\boldsymbol{w}^{\top}\boldsymbol{x}| \leq \Vert\boldsymbol{w}\Vert_2\Vert\boldsymbol{x}\Vert_2$. Based on this result, we can prove:

\begin{equation}\Vert\boldsymbol{W}(\boldsymbol{x}_2 - \boldsymbol{x}_1)\Vert_2\leq \Vert\boldsymbol{W}\Vert_2\Vert\boldsymbol{x}_2 - \boldsymbol{x}_1\Vert_2\end{equation}

where $\Vert\boldsymbol{W}\Vert_2^2$ is equal to the sum of the squares of all elements of the matrix $\boldsymbol{W}$. The proof is not difficult, and interested readers can complete it themselves. This result tells us: the change in $\boldsymbol{W}\boldsymbol{x}$ can be controlled by $\Vert\boldsymbol{W}\Vert_2$ and $\Vert\boldsymbol{x}_2 - \boldsymbol{x}_1\Vert_2$. Therefore, if we want the change in $\boldsymbol{W}\boldsymbol{x}$ to be as small as possible when $\Vert\boldsymbol{x}_2 - \boldsymbol{x}_1\Vert_2$ is small, we can reduce $\Vert\boldsymbol{W}\Vert_2$. In this case, we can add a regularization term $\mathcal{L}_{reg}=\Vert\boldsymbol{W}\Vert_2^2$ to the task objective $\mathcal{L}_{task}$. It is easy to see that this is essentially L2 regularization. Further discussion from this perspective can also be found in "Lipschitz Constraints in Deep Learning: Generalization and Generative Models" (though note that the notation in the two articles differs slightly).

AdamW Optimizer

When using SGD for optimization, assuming the original iteration is $\boldsymbol{\theta}_{t}=\boldsymbol{\theta}_{t-1} - \varepsilon\boldsymbol{g}_{t}$, it is easy to prove that after adding L2 regularization $\Vert\boldsymbol{\theta}\Vert_2^2$, it becomes:

\begin{equation}\boldsymbol{\theta}_{t}=(1-\varepsilon\lambda)\boldsymbol{\theta}_{t-1} - \varepsilon\boldsymbol{g}_{t}\end{equation}

Since $0 < 1-\varepsilon\lambda < 1$, this causes the parameters $\boldsymbol{\theta}$ to have a tendency to "shrink" toward 0 during the optimization process. This modification is called "Weight Decay."

However, the equivalence between L2 regularization and weight decay only holds under the SGD optimizer. If an adaptive learning rate optimizer such as Adagrad or Adam is used, the two are not equivalent. In adaptive learning rate optimizers, the effect of L2 regularization is roughly equal to adding $-\varepsilon\lambda\text{sign}(\boldsymbol{\theta}_{t-1})$ to the optimization process rather than $-\varepsilon\lambda\boldsymbol{\theta}_{t-1}$. In other words, the penalty for each element is uniform, rather than elements with larger absolute values receiving a larger penalty; this partially cancels out the effect of L2 regularization. The paper "Decoupled Weight Decay Regularization" first emphasized this issue and proposed the improved AdamW optimizer.

New Regularization

In this section, we will point out that the "Weight Scale Shifting" phenomenon often exists in common deep learning models. This phenomenon may lead to the effect of L2 regularization being less significant. Furthermore, we can construct a new regularization term that has a similar effect to L2 but is more coordinated with the weight scale shifting phenomenon, which theoretically should be more effective.

Weight Scale Shifting

We know that the basic structure of deep learning models is "linear transformation + non-linear activation function," and one of the most commonly used activation functions now is $\text{relu}(x)=\max(x,0)$. Interestingly, both satisfy "positive homogeneity," meaning that for $\varepsilon \geq 0$, we have $\varepsilon\phi(x)=\phi(\varepsilon x)$ identifying as true. For other activation functions like SoftPlus, GELU, and Swish, they are actually smooth approximations of $\text{relu}$, so they can be considered to approximately satisfy "positive homogeneity."

"Positive homogeneity" makes deep learning models invariant to a certain degree of weight scale shifting. Specifically, assume an $L$-layer model:

\begin{equation} \begin{aligned} \boldsymbol{h}_L =& \phi(\boldsymbol{W}_L \boldsymbol{h}_{L-1} + \boldsymbol{b}_L) \\ =& \phi(\boldsymbol{W}_L \phi(\boldsymbol{W}_{L-1} \boldsymbol{h}_{L-2} + \boldsymbol{b}_{L-1}) + \boldsymbol{b}_L) \\ =& \dots \\ =& \phi(\boldsymbol{W}_L \phi(\boldsymbol{W}_{L-1} \phi(\dots\phi(\boldsymbol{W}_1\boldsymbol{x} + \boldsymbol{b}_1)\dots) + \boldsymbol{b}_{L-1}) + \boldsymbol{b}_L) \end{aligned} \end{equation}

Suppose each parameter introduces a shift $\boldsymbol{W}_l = \gamma_l\tilde{\boldsymbol{W}}_l, \boldsymbol{b}_l = \gamma_l\tilde{\boldsymbol{b}}_l$. Then according to positive homogeneity, we have:

\begin{equation} \begin{aligned} \boldsymbol{h}_L =& \left(\prod_{l=1}^L \gamma_l\right)\phi(\tilde{\boldsymbol{W}}_L \boldsymbol{h}_{L-1} + \tilde{\boldsymbol{b}}_L) \\ =& \dots \\ =& \left(\prod_{l=1}^L \gamma_l\right) \phi(\tilde{\boldsymbol{W}}_L \phi(\tilde{\boldsymbol{W}}_{L-1} \phi(\dots\phi(\tilde{\boldsymbol{W}}_1\boldsymbol{x} + \tilde{\boldsymbol{b}}_1)\dots) + \tilde{\boldsymbol{b}}_{L-1}) + \tilde{\boldsymbol{b}}_L) \end{aligned} \end{equation}

If $\prod_{l=1}^L \gamma_l = 1$, then the model with parameters $\{\boldsymbol{W}_l, \boldsymbol{b}_l\}$ is completely equivalent to the model with parameters $\{\tilde{\boldsymbol{W}}_l, \tilde{\boldsymbol{b}}_l\}$. In other words, the model is invariant to weight scale shifts where $\prod_{l=1}^L \gamma_l = 1$ (WEIght-Scale-Shift-Invariance, WEISSI).

Incoordination with L2 Regularization

Just now we said that as long as the scale shift satisfies $\prod_{l=1}^L \gamma_l = 1$, then the models corresponding to the two sets of parameters are equivalent. The problem is that their corresponding L2 regularizations are not equivalent:

\begin{equation}\sum_{l=1}^L \Vert\boldsymbol{W}_l\Vert_2^2=\sum_{l=1}^L \gamma_l^2\Vert\tilde{\boldsymbol{W}_l}\Vert_2^2\neq \sum_{l=1}^L \Vert\tilde{\boldsymbol{W}_l}\Vert_2^2\end{equation}

And it can be proven that if we fix $\Vert\boldsymbol{W}_1\Vert_2, \Vert\boldsymbol{W}_2\Vert_2, \dots, \Vert\boldsymbol{W}_L\Vert_2$ and maintain the constraint $\prod_{l=1}^L \gamma_l = 1$, the minimum value of $\sum_{l=1}^L \Vert\tilde{\boldsymbol{W}}_l\Vert_2^2$ occurs at:

\begin{equation}\Vert\tilde{\boldsymbol{W}_1}\Vert_2=\Vert\tilde{\boldsymbol{W}_2}\Vert_2=\dots=\Vert\tilde{\boldsymbol{W}}_L\Vert_2=\left(\prod_{l=1}^L \Vert\boldsymbol{W}_l\Vert_2\right)^{1/L}\end{equation}

In fact, this reflects the inefficiency of L2 regularization. Imagine that we have already trained a set of parameters $\{\boldsymbol{W}_l, \boldsymbol{b}_l\}$, and the generalization performance of this set of parameters may not be very good. So we hope that L2 regularization can help the optimizer find a better set of parameters (sacrificing a little $\mathcal{L}_{task}$ to reduce $\mathcal{L}_{reg}$). However, the above results tell us that due to the existence of weight scale shift invariance, the model can fully find a new set of parameters $\{\tilde{\boldsymbol{W}}_l, \tilde{\boldsymbol{b}}_l\}$, which is completely equivalent to the original parameter model (no improvement in generalization performance), but the L2 regularization is even smaller (L2 regularization took effect). To put it bluntly, L2 regularization indeed worked, but it did not improve the model's generalization performance, failing to achieve the original intention of using L2 regularization.

WEISSI Regularization

The root of the above problem is that the model is invariant to weight scale shifting, but L2 regularization is not invariant to weight scale shifting. If we can find a new regularization term that has a similar effect but is also invariant to weight scale shifting, the problem can be solved. Personally, I feel that the original paper's explanation of this part is not clear enough. The following derivation is primarily based on my personal understanding.

We consider a regularization term of the following general form:

\begin{equation}\mathcal{L}_{reg}=\sum_{l=1}^L \varphi(\Vert\boldsymbol{W}_l\Vert_2)\end{equation}

For L2 regularization, $\varphi(x)=x^2$. As long as $\varphi(x)$ is a monotonically increasing function with respect to $x$ on $[0, +\infty)$, it can be guaranteed that the optimization goal is to reduce $\Vert\boldsymbol{W}_l\Vert$. Note that we want the regularization term to have scale shift invariance; we do not need $\varphi(\gamma x) = \varphi(x)$, but only:

\begin{equation}\frac{d}{dx}\varphi(\gamma x)=\frac{d}{dx}\varphi(x) \label{eq:varphi}\end{equation}

because the optimization process only needs to use its gradient. Some readers might immediately see a solution; it is actually the logarithmic function $\varphi(x) = \log x$. So the newly proposed regularization term is:

\begin{equation}\mathcal{L}_{reg}=\sum_{l=1}^L \log\Vert\boldsymbol{W}_l\Vert_2=\log \left(\prod_{l=1}^L \Vert\boldsymbol{W}_l\Vert_2\right)\end{equation}

In addition, the original paper might have been concerned that the penalty of the above regularization term was not strong enough, so they also added an L1 penalty to the parameter direction. The total form is:

\begin{equation}\mathcal{L}_{reg}=\lambda_1\sum_{l=1}^L \log\Vert\boldsymbol{W}_l\Vert_2 + \lambda_2\sum_{l=1}^L \big\Vert\boldsymbol{W}_l\big/\Vert\boldsymbol{W}_l\Vert_2\big\Vert_1\end{equation}

Brief Description of Experimental Results

As per convention, let's show the experimental results from the original paper. Naturally, since the authors organized them into a paper, it clearly shows positive results:

One of the experimental results of WEISSI regularization in the original paper
One of the experimental results of WEISSI regularization in the original paper

For us, it's simply knowing that there is such a new choice, providing one more thing to try when "alchemy" (training models). After all, with regularization terms, there is no theory that guarantees they will definitely work; you only know the result after using them. No matter how beautifully expressed by others, it may not necessarily be useful.

Article Summary

This article introduces the phenomenon of weight scale shift invariance in neural network models and points out its incompatibility with L2 regularization, subsequently proposing a regularization term that acts similarly but resolves this incompatibility.