Lipschitz Constraint in Deep Learning: Generalization and Generative Models

By 苏剑林 | October 07, 2018

Preface: Last year, I wrote an introductory article on WGAN-GP titled "The Art of Mutual Grumbling: From Zero to WGAN-GP", mentioning that a Lipschitz constraint (hereafter referred to as the "L-constraint") is added to the discriminator of WGAN through gradient penalty. A few days ago, while pondering, I thought about WGAN again and felt that the gradient penalty in WGAN-GP isn't elegant enough. I also heard that WGAN has difficulties with conditional generation (because random interpolation between different classes can become messy...). So, I wanted to explore whether a new scheme could be devised to add an L-constraint to the discriminator.

After a few days of working behind closed doors, I discovered that what I had thought of had already been done by others—it seems there is truly nothing you can imagine that hasn't already been accomplished. It is mainly covered in these two papers: "Spectral Norm Regularization for Improving the Generalizability of Deep Learning" and "Spectral Normalization for Generative Adversarial Networks".

Therefore, this article will briefly introduce the content related to the L-constraint based on my own understanding. Note that the theme of this article is the L-constraint, not just WGAN. It can be used in generative models as well as in general supervised learning.

L-Constraint and Generalization

Perturbation Sensitivity

Let the input be $x$, the output be $y$, the model be $f$, and the model parameters be $w$, denoted as: \begin{equation}y = f_w(x)\end{equation} Often, we want to obtain a "robust" model. What does robust mean? Generally, it has two meanings: first, stability with respect to parameter perturbations—for example, can the model still achieve similar results after becoming $f_{w+\Delta w}(x)$? In a dynamical system, we also consider whether the model can eventually return to $f_w(x)$. Second, stability with respect to input perturbations—for example, when the input changes from $x$ to $x+\Delta x$, does $f_w(x+\Delta x)$ still provide similar prediction results? Readers might have heard that deep learning models are susceptible to "adversarial attacks," where changing just one pixel in an image can lead to a completely different classification result; this is a case where the model is too sensitive to input perturbations.

L-Constraint

Therefore, most of the time we hope the model is insensitive to input perturbations, which usually improves the model's generalization performance. That is, we hope that when $\Vert x_1 - x_2 \Vert$ is very small, \begin{equation}\Vert f_w(x_1) - f_w(x_2)\Vert\end{equation} is also as small as possible. Of course, what "as small as possible" means exactly is hard to define. Thus, Lipschitz proposed a more specific constraint: there exists a constant $C$ (which depends only on the parameters and not on the inputs) such that the following inequality always holds: \begin{equation}\Vert f_w(x_1) - f_w(x_2)\Vert \leq C(w)\cdot \Vert x_1 - x_2 \Vert\label{eq:l-cond}\end{equation} In other words, we want the entire model to be "controlled" by a linear function. This is the L-constraint.

In other words, here we consider a model that satisfies the L-constraint to be a good model. Furthermore, for a specific model, we hope to estimate the expression for $C(w)$ and hope that $C(w)$ is as small as possible. A smaller $C$ means it is less sensitive to input perturbations and has better generalization.

Neural Networks

Here we analyze specific neural networks to observe when a neural network satisfies the L-constraint.

For simplicity, consider a single-layer fully connected network $f(Wx+b)$, where $f$ is the activation function and $W, b$ are the weight matrix and bias vector. Equation $\eqref{eq:l-cond}$ becomes: \begin{equation}\Vert f(Wx_1+b) - f(Wx_2+b)\Vert \leq C(W,b)\cdot \Vert x_1 - x_2 \Vert\end{equation} If we let $x_1, x_2$ be sufficiently close, we can approximate the left side with a first-order term to get: \begin{equation}\left\Vert \frac{\partial f}{\partial y}W(x_1 - x_2)\right\Vert \leq C(W,b)\cdot \Vert x_1 - x_2 \Vert\end{equation} where $y=Wx_2 + b$. Obviously, for the left side not to exceed the right side, the absolute value of each element in $\partial f / \partial y$ must not exceed a certain constant. This requires us to use activation functions with "bounded derivatives." Fortunately, our commonly used activation functions, such as sigmoid, tanh, and ReLU, all satisfy this condition. Assuming the gradient of the activation function is already bounded—especially for the ReLU activation function commonly used where this bound is 1—the $\partial f / \partial y$ term introduces only a constant. We can temporarily ignore it, and we only need to consider $\Vert W(x_1 - x_2)\Vert$.

Multi-layer neural networks can be analyzed step-by-step recursively, ultimately reducing to a single-layer problem. Structures like CNNs and RNNs are essentially special cases of fully connected layers, so the results for fully connected layers apply. Therefore, for neural networks, the problem becomes: if \begin{equation}\Vert W(x_1 - x_2)\Vert \leq C\Vert x_1 - x_2 \Vert\label{sec:l-cond-nn}\end{equation} holds identically, what can the value of $C$ be? Once we find the expression for $C$, we can aim to make $C$ as small as possible, thereby introducing a regularization term $C^2$ for the parameters.

Matrix Norms

Definition

By now, we have translated the problem into a matrix norm problem (the matrix norm acts similarly to the length of a vector). It is defined as: \begin{equation}\Vert W\Vert_2 = \max_{x\neq 0}\frac{\Vert Wx\Vert}{\Vert x\Vert}\label{eq:m-norm}\end{equation} If $W$ is a square matrix, this norm is also known as the "spectral norm" or "spectral radius." In this article, even if it is not square, we will call it the "Spectral Norm." Note that $\Vert Wx\Vert$ and $\Vert x\Vert$ both refer to vector norms, i.e., the ordinary vector magnitude. While we didn't have an explicit definition for the matrix norm on the left, it is defined through the limit of the vector model on the right. Therefore, this type of matrix norm is called a "matrix norm induced by a vector norm."

Alright, no more technical jargon. With the concept of vector norms, we have: \begin{equation}\Vert W(x_1 - x_2)\Vert \leq \Vert W\Vert_2 \cdot \Vert x_1 - x_2 \Vert\end{equation} Well, we haven't actually done much except change the notation. We still haven't figured out what $\Vert W\Vert_2$ equals.

Frobenius Norm

In fact, the precise concept and calculation method of the spectral norm $\Vert W\Vert_2$ requires several concepts from linear algebra. Let's set it aside for a moment and study a simpler norm: the Frobenius norm, or F-norm for short.

This name might sound intimidating, but its definition is extremely simple: \begin{equation}\Vert W\Vert_F = \sqrt{\sum_{i,j}w_{ij}^2}\end{equation} To put it simply, it treats the matrix as a vector and calculates the Euclidean length of that vector.

Using the Cauchy-Schwarz inequality, we can easily prove: \begin{equation}\Vert Wx\Vert \leq \Vert W\Vert_F \cdot \Vert x \Vert\end{equation} Clearly, $\Vert W\Vert_F$ provides an upper bound for $\Vert W\Vert_2$. That is, you can understand $\Vert W\Vert_2$ as the most accurate $C$ in equation $\eqref{sec:l-cond-nn}$ (the smallest $C$ among all that satisfy equation $\eqref{sec:l-cond-nn}$). But if you don't care much about precision, you can directly take $C=\Vert W\Vert_F$, which also makes equation $\eqref{sec:l-cond-nn}$ hold, given that $\Vert W\Vert_F$ is easy to calculate.

L2 Regularization

As mentioned earlier, to make the neural network satisfy the L-constraint as much as possible, we should hope that $C=\Vert W\Vert_2$ is as small as possible. We can add $C^2$ as a regularization term to the loss function. Although we haven't calculated the spectral norm $\Vert W\Vert_2$ yet, we found a larger upper bound $\Vert W\Vert_F$. Let's use that for now. The loss becomes: \begin{equation}loss = loss(y, f_w(x)) + \lambda \Vert W\Vert_F^2\label{eq:l2-regular}\end{equation} where the first part is the original model loss. If we look back at the expression for $\Vert W\Vert_F$, we find that the added regularization term is: \begin{equation}\lambda\left(\sum_{i,j}w_{ij}^2\right)\end{equation} Isn't this just L2 regularization?

Finally, after all that effort, we've gained something: we revealed the connection between L2 regularization (also known as weight decay) and the L-constraint. It shows that L2 regularization helps the model satisfy the L-constraint, thereby reducing the model's sensitivity to input perturbations and enhancing its generalization performance.

Spectral Norm

Principal Eigenvalue

In this section, we officially confront the spectral norm $\Vert W\Vert_2$. This is part of linear algebra and is quite theoretical.

In fact, the spectral norm $\Vert W\Vert_2$ is equal to the square root of the largest eigenvalue (principal eigenvalue) of $W^{\top}W$. If $W$ is a square matrix, then $\Vert W\Vert_2$ is equal to the absolute value of the largest eigenvalue of $W$.

Note: For readers interested in the theoretical proof, here is a rough outline. According to the definition $\eqref{eq:m-norm}$, we have: $$\Vert W\Vert_2^2 = \max_{x\neq 0}\frac{x^{\top}W^{\top} Wx}{x^{\top} x} = \max_{\Vert x\Vert=1}x^{\top}W^{\top} Wx$$ Assume $W^{\top} W$ is diagonalized as $\text{diag}(\lambda_1,\dots,\lambda_n)$, i.e., $W^{\top} W=U^{\top}\text{diag}(\lambda_1,\dots,\lambda_n)U$, where $\lambda_i$ are its eigenvalues (all non-negative) and $U$ is an orthogonal matrix. Since the product of an orthogonal matrix and a unit vector is still a unit vector, then: $$\begin{aligned}\Vert W\Vert_2^2 =& \max_{\Vert x\Vert=1}x^{\top}\text{diag}(\lambda_1,\dots,\lambda_n) x \\ =& \max_{\Vert x\Vert=1} \lambda_1 x_1^2 + \dots + \lambda_n x_n^2\\ \leq & \max\{\lambda_1,\dots,\lambda_n\} (x_1^2 + \dots + x_n^2)\quad(\text{note that }\Vert x\Vert=1)\\ =&\max\{\lambda_1,\dots,\lambda_n\}\end{aligned}$$ Thus $\Vert W\Vert_2^2$ is equal to the largest eigenvalue of $W^{\top} W$.

Power Iteration

Some readers might be getting impatient: "Who cares if it's an eigenvalue? I care about how to calculate this damn norm!!"

In fact, the previous content, though seemingly vague, is the basis for calculating $\Vert W\Vert_2$. The previous section told us that $\Vert W\Vert_2^2$ is the largest eigenvalue of $W^{\top}W$. Thus, the problem becomes finding the largest eigenvalue of $W^{\top}W$, which can be solved via the "Power Iteration" method.

The so-called "Power Iteration" refers to the following iteration format: \begin{equation}u \leftarrow \frac{(W^{\top}W)u}{\Vert (W^{\top}W)u\Vert}\end{equation} After repeating this several times, the norm is finally obtained via: \begin{equation}\Vert W\Vert_2^2 \approx u^{\top}W^{\top}Wu\end{equation} (i.e., obtaining an approximate value for the largest eigenvalue). It can be equivalently rewritten as: \begin{equation}v\leftarrow \frac{W^{\top}u}{\Vert W^{\top}u\Vert},\,u\leftarrow \frac{Wv}{\Vert Wv\Vert},\quad \Vert W\Vert_2 \approx u^{\top}Wv\label{eq:m-norm-iter}\end{equation} In this way, after initializing $u, v$ (using a vector of all ones), you can iterate a few times to get $u, v$, and then plug them into $u^{\top}Wv$ to calculate the approximate value of $\Vert W\Vert_2$.

Note: For readers interested in the proof, here is a simple proof of why this iteration works.

Let $A=W^{\top}W$, and start with an initial $u^{(0)}$. Assuming $A$ is diagonalizable and its eigenvalues $\lambda_1, \dots, \lambda_n$ satisfy the condition that the largest eigenvalue is strictly greater than the others (if not, it means the largest eigenvalue is a repeated root; this requires more specialized proofs. From a numerical computing perspective, almost no two numbers are exactly equal, so we can assume repeated roots won't appear in experiments), then the eigenvectors $\eta_1, \dots, \eta_n$ of $A$ form a complete basis. Thus, we can write: $$u^{(0)} = c_1 \eta_1 + \dots + c_n \eta_n$$ Each iteration involves $Au/\Vert Au\Vert$, where the denominator only changes the magnitude. Looking at the repeated action of $A$: $$A^r u^{(0)} = c_1 A^r \eta_1 + \dots + c_n A^r \eta_n$$ Note that for eigenvectors, $A\eta = \lambda \eta$, thus: $$A^r u^{(0)} = c_1 \lambda_1^r \eta_1 + \dots + c_n \lambda_n^r \eta_n$$ Without loss of generality, let $\lambda_1$ be the largest eigenvalue. Then: $$\frac{A^r u^{(0)}}{\lambda_1^r} = c_1 \eta_1 + c_2 \left(\frac{\lambda_2}{\lambda_1}\right)^r + \dots + c_n \left(\frac{\lambda_n}{\lambda_1}\right)^r \eta_n$$ Since $\lambda_2/\lambda_1, \dots, \lambda_n/\lambda_1$ are all less than 1, as $r \to \infty$, they all tend toward zero. When $r$ is large enough, they can be ignored: $$\frac{A^r u^{(0)}}{\lambda_1^r} \approx c_1 \eta_1$$ Disregarding the magnitude, this result shows that for sufficiently large $r$, $A^r u^{(0)}$ provides the approximate direction of the eigenvector corresponding to the largest eigenvalue. Each normalization step is simply to prevent overflow. Thus, $u = A^r u^{(0)}/\Vert A^r u^{(0)}\Vert$ is the corresponding unit eigenvector, i.e.: $$Au = \lambda_1 u$$ Therefore: $$u^{\top}Au = \lambda_1 u^\top u = \lambda_1$$ This calculates the square of the spectral norm.

Spectral Regularization

Earlier, we showed the relationship between the Frobenius norm and L2 regularization. We also explained that the Frobenius norm is a stronger (coarser) condition, while the spectral norm is more accurate. Although the spectral norm is not as easy to calculate as the Frobenius norm, it can still be approximated in a few steps using equation $\eqref{eq:m-norm-iter}$.

Thus, we can propose the concept of "Spectral Norm Regularization," which uses the square of the spectral norm as an additional regularization term, replacing simple L2 regularization. That is, equation $\eqref{eq:l2-regular}$ becomes: \begin{equation}loss = loss(y, f_w(x)) + \lambda \Vert W\Vert_2^2\end{equation} The paper "Spectral Norm Regularization for Improving the Generalizability of Deep Learning" performed several experiments showing that "Spectral Regularization" improves model performance across multiple tasks.

In Keras, the spectral norm can be calculated with the following code:

def spectral_norm(w, r=5):
    w_shape = K.int_shape(w)
    in_dim = np.prod(w_shape[:-1]).astype(int)
    out_dim = w_shape[-1]
    w = K.reshape(w, (in_dim, out_dim))
    u = K.ones((1, in_dim))
    for i in range(r):
        v = K.l2_normalize(K.dot(u, w))
        u = K.l2_normalize(K.dot(v, K.transpose(w)))
    return K.sum(K.dot(K.dot(u, w), K.transpose(v)))

Generative Models

WGAN

If the L-constraint is just "the icing on the cake" in ordinary supervised training, it is a crucial, indispensable step in the WGAN discriminator. The optimization objective of the WGAN discriminator is: \begin{equation}W(P_r,P_g)=\sup_{\|f\|_L = 1}\mathbb{E}_{x\sim P_r}[f(x)] - \mathbb{E}_{x\sim P_g}[f(x)]\end{equation} where $P_r, P_g$ are the real and generated distributions respectively, and $\|f\|_L = 1$ refers to satisfying the specific L-constraint $\|f(x_1) - f(x_2)\| \leq \Vert x_1 - x_2\Vert$ (where $C=1$). The meaning of this objective is to pick the $f$ that maximizes $\mathbb{E}_{x\sim P_r}[f(x)] - \mathbb{E}_{x\sim P_g}[f(x)]$ among all functions satisfying this L-constraint—that is the ideal discriminator. Written in loss form, it is: \begin{equation}\min_{\|f\|_L = 1} \mathbb{E}_{x\sim P_g}[f(x)] - \mathbb{E}_{x\sim P_r}[f(x)]\end{equation}

Gradient Penalty

Currently, an effective approach is the gradient penalty. Since $\Vert f'(x)\Vert = 1$ is a sufficient condition for $\|f\|_L = 1$, we can add this as a penalty term to the discriminator's loss: \begin{equation}\min_{f} \mathbb{E}_{x\sim P_g}[f(x)] - \mathbb{E}_{x\sim P_r}[f(x)] + \lambda (\Vert f'(x_{inter})\Vert-1)^2\end{equation} Actually, I think adding a $relu(x)=\max(x,0)$ is better: \begin{equation}\min_{f} \mathbb{E}_{x\sim P_g}[f(x)] - \mathbb{E}_{x\sim P_r}[f(x)] + \lambda \max(\Vert f'(x_{inter})\Vert-1, 0)^2\end{equation} where $x_{inter}$ is sampled using random interpolation: \begin{equation}\begin{aligned}&x_{inter} = \varepsilon x_{real} + (1 - \varepsilon) x_{fake}\\ &\varepsilon\sim U[0,1],\quad x_{real}\sim P_r,\quad x_{fake}\sim P_g \end{aligned}\end{equation} Gradient penalty cannot guarantee $\Vert f'(x)\Vert = 1$, but intuitively it vibrates around 1, so $\|f\|_L$ theoretically also stays near 1, thus approximately achieving the L-constraint.

This scheme works well in many cases, but it performs poorly when the real sample classes are numerous (especially in conditional generation). The problem lies in random interpolation: in principle, the L-constraint should be satisfied throughout the entire space, but gradient penalty through linear interpolation can only guarantee it within a small region. If this small region happens to be the space between real and generated samples, it might be sufficient. However, if there are many classes, interpolating between different classes often lands in unknown territory, resulting in the L-condition not being met where it matters. Thus, the discriminator fails.

Thinking: Can gradient penalty be used directly as a regularization term for supervised models? Interested readers can give it a try.

Spectral Normalization

The problem with gradient penalty is that it is just a penalty and only works locally. A truly brilliant scheme is the constructive method: build a special $f$ such that regardless of its parameters, $f$ always satisfies the L-constraint.

In fact, when WGAN was first proposed, it used weight clipping—clipping all parameters to not exceed a certain constant. In this way, the Frobenius norm of the parameters won't exceed a certain value, so $\|f\|_L$ won't exceed a certain constant. Although this doesn't accurately implement $\|f\|_L=1$, it only scales the loss by a constant factor, which doesn't affect optimization results. Weight clipping is a constructive method, but it is not very optimization-friendly.

Looking at it simply, the clipping scheme has a lot of room for optimization. For example, change it to clipping the Frobenius norm of all weights to not exceed a constant; this provides more flexibility than direct weight clipping. If clipping is too crude, parameter penalty is also an option—imposing a heavy penalty whenever the norm exceeds the Frobenius norm. I've tried this, and it generally works, but the convergence speed is slow.

However, these schemes are just approximations. Now that we have the spectral norm, we can use the most accurate scheme: replace all parameters $w$ in $f$ with $w/\Vert w\Vert_2$. This is Spectral Normalization, proposed and experimented with in "Spectral Normalization for Generative Adversarial Networks". Consequently, if the absolute value of the derivative of the activation functions used in $f$ does not exceed 1, then we have $\|f\|_L\leq 1$, achieving the required L-constraint with the most precise method.

Note: "The derivative of activation functions does not exceed 1" is usually satisfied. However, if the discriminator uses residual structures, the activation function is equivalent to $x + relu(Wx+b)$, in which case its derivative may exceed 1. But regardless, it will not exceed a constant, so it does not affect optimization.

I have tried using spectral normalization in WGAN myself (without gradient penalty; see below for reference code) and found that the final convergence speed (epochs required for the same effect) is faster than WGAN-GP, and the effect is even better. Moreover, there is another reason affecting speed: the execution time per epoch using gradient penalty is longer than using spectral normalization. This is because with gradient penalty, the second-order gradient must be calculated during descent, requiring the entire forward pass to be executed twice, which is slow.

Keras Implementation

In Keras, implementing spectral normalization is both easy and not easy.

By easy, I mean you only need to pass a kernel_constraint parameter to every convolutional and fully connected layer of the discriminator, and gamma_constraint to BN layers. The constraint can be written as:

def spectral_normalization(w):
    return w / spectral_norm(w)

Reference Code: https://github.com/bojone/gan/blob/master/keras/wgan_sn_celeba.py

By not easy, I mean that in the current Keras (version 2.2.4), kernel_constraint does not actually change the kernel; it only adjusts the kernel value after the gradient descent step. This is different from the spectral_normalization approach in the paper. If used this way, you will find that the gradients become inaccurate later on, and the generation quality is poor. To truly modify the kernel, we either have to redefine all layers (Convolution, Dense, BN, etc.) that include matrix multiplication, or modify the source code. Modifying the source is the easiest solution. Modify the add_weight method of the Layer object in keras/engine/base_layer.py. Originally (starting from line 222):

 def add_weight(self,
                name,
                shape,
                dtype=None,
                initializer=None,
                regularizer=None,
                trainable=True,
                constraint=None):
     """Adds a weight variable to the layer.
     # Arguments
         name: String, the name for the weight variable.
         shape: The shape tuple of the weight.
         dtype: The dtype of the weight.
         initializer: An Initializer instance (callable).
         regularizer: An optional Regularizer instance.
         trainable: A boolean, whether the weight should
             be trained via backprop or not (assuming
             that the layer itself is also trainable).
         constraint: An optional Constraint instance.
     # Returns
         The created weight variable.
     """
     initializer = initializers.get(initializer)
     if dtype is None:
         dtype = K.floatx()
     weight = K.variable(initializer(shape),
                        dtype=dtype,
                        name=name,
                        constraint=constraint)
     if regularizer is not None:
         with K.name_scope('weight_regularizer'):
             self.add_loss(regularizer(weight))
     if trainable:
         self._trainable_weights.append(weight)
     else:
         self._non_trainable_weights.append(weight)
     return weight

Change it to:

 def add_weight(self,
                name,
                shape,
                dtype=None,
                initializer=None,
                regularizer=None,
                trainable=True,
                constraint=None):
     """Adds a weight variable to the layer.
     # Arguments
         name: String, the name for the weight variable.
         shape: The shape tuple of the weight.
         dtype: The dtype of the weight.
         initializer: An Initializer instance (callable).
         regularizer: An optional Regularizer instance.
         trainable: A boolean, whether the weight should
             be trained via backprop or not (assuming
             that the layer itself is also trainable).
         constraint: An optional Constraint instance.
     # Returns
         The created weight variable.
     """
     initializer = initializers.get(initializer)
     if dtype is None:
         dtype = K.floatx()
     weight = K.variable(initializer(shape),
                        dtype=dtype,
                        name=name,
                        constraint=None)
     if regularizer is not None:
         with K.name_scope('weight_regularizer'):
             self.add_loss(regularizer(weight))
     if trainable:
         self._trainable_weights.append(weight)
     else:
         self._non_trainable_weights.append(weight)
     if constraint is not None:
         return constraint(weight)
     return weight

That is, change constraint in K.variable to None and execute constraint at the end. Note: don't complain that Keras is too encapsulated or inflexible just because you have to change the source. If you use other frameworks, the modification required would likely be many times more complex (relative to the amount of change for a GAN without spectral normalization).

(Update: a new implementation that doesn't require modifying the source code is here.)

Summary

This article is a summary of the Lipschitz constraint, mainly introducing how to make models satisfy the Lipschitz constraint, which relates to the model's generalization ability. The more difficult concept is the spectral norm, involving many theories and formulas.

Overall, the content related to the spectral norm is quite refined, and the relevant conclusions further show that linear algebra is closely linked to machine learning. Many "advanced" linear algebra concepts can find their corresponding applications in machine learning.