By 苏剑林 | Dec 21, 2020
As is well known, before the era of deep learning, machine learning was dominated by SVM (Support Vector Machine). It was once the center of attention in the machine learning world, captivating countless researchers. Even today, "hand-deriving SVM" remains one of the popular interview questions at major tech companies. However, times have changed. When deep learning became popular, the first thing it "revolutionized" was SVM. Now, the presence of SVM is mostly found in specific scenarios where efficiency is paramount or in the aforementioned interview questions.
In a surprising turn of events, a recent paper on Arxiv, "Every Model Learned by Gradient Descent Is Approximately a Kernel Machine", made a very "bold" declaration:
Any model learned by a gradient descent algorithm can be approximately viewed as an SVM!
This conclusion is truly "bold" because it isn't just aimed at deep learning; it suggests that as long as you optimize using gradient descent, the resulting model is nothing more than an (approximation of an) SVM. I have reviewed the analysis in the original paper and found it to be quite interesting and reasonable. It helps deepen our understanding of many models, so I would like to share it with everyone.
SVM Basics
A general SVM can be expressed in the following form:
\begin{equation}
y=g\left(\beta+\sum_i \alpha_i K(x,x_i)\right) \label{eq:svm}
\end{equation}
Where $\{(x_i, y_i) \}$ are training data pairs, and $\alpha_i, \beta$ are learnable parameters. The output of a standard kernel machine is a scalar, so here we consider $y, \alpha_i, \beta$ to be scalars. $K(x, x_i)$ is called the "kernel function," which measures a certain similarity between the input $x$ and the training sample $x_i$. SVM is a type of the broader "Kernel Machine" model (probably the most famous one) and belongs to the category of "kernel methods."
Intuitively, an SVM is actually a retrieval model. It retrieves the similarity $K(x, x_i)$ between the input and all training samples and then computes a weighted sum. Therefore, strictly speaking, the parameter set of an SVM includes not only the various $\alpha_i$ and $\beta$ but also the training inputs $x_i$. Simply put, it memorizes the entire training set. In contrast, deep learning models also have many parameters, but these parameters are directly solved by gradient descent rather than storing the training set directly. Because of this feature, deep learning models are generally thought to be able to learn more intelligent features automatically.
Analytical Derivation
The theory of SVM is not the focus of this article; it is sufficient for us to know its form as shown in \eqref{eq:svm}. In this section, we will derive an analytical solution for gradient descent and discover that this solution has a very similar form to equation \eqref{eq:svm}, which is why we say that models produced by gradient descent can be approximately viewed as SVM models.
Suppose our model is $y=f_\theta(x)$, where $\theta$ are the trainable parameters. The loss function for a single sample is $l(y_i, f_\theta(x_i))$. Then the loss function used for training is:
\begin{equation}
L(\theta)=\sum_i l(y_i, f_\theta(x_i))
\end{equation}
To make the subsequent derivation more concise, we use the summation form here. Generally, an average is used, but this does not affect the final result. In the "Optimization Algorithms from a Dynamical Perspective" series of articles, we maintain the view that solving for parameters $\theta$ using gradient descent is equivalent to solving the dynamical system:
\begin{equation}
\frac{d\theta}{dt}=-\frac{\partial L(\theta)}{\partial \theta}=-\sum_i \frac{\partial l(y_i, f_\theta(x_i))}{\partial \theta}=-\sum_i \frac{\partial l(y_i, f_\theta(x_i))}{\partial f_\theta(x_i)} \frac{\partial f_\theta(x_i)}{\partial \theta}
\end{equation}
This is also the key starting point of this article. Now, let's consider the change in $f_\theta(x)$:
\begin{equation}
\frac{df_\theta(x)}{dt}=\sum_j \frac{\partial f_\theta(x)}{\partial \theta_j} \frac{d\theta_j}{dt}=-\sum_j \frac{\partial f_\theta(x)}{\partial \theta_j} \sum_i \frac{\partial l(y_i, f_\theta(x_i))}{\partial f_\theta(x_i)} \frac{\partial f_\theta(x_i)}{\partial \theta_j}=-\sum_i \frac{\partial l(y_i, f_\theta(x_i))}{\partial f_\theta(x_i)} \sum_j \frac{\partial f_\theta(x)}{\partial \theta_j} \frac{\partial f_\theta(x_i)}{\partial \theta_j}
\end{equation}
We can see that the summation step over $j$ is actually the inner product of the gradients $\langle \nabla_\theta f_\theta(x), \nabla_\theta f_\theta(x_i) \rangle$. In neural networks, this has a very cool name called the "Neural Tangent Kernel," which we denote as:
\begin{equation}
K_\theta(x, x_i) = \langle \nabla_\theta f_\theta(x), \nabla_\theta f_\theta(x_i) \rangle = \sum_j \frac{\partial f_\theta(x)}{\partial \theta_j} \frac{\partial f_\theta(x_i)}{\partial \theta_j}
\end{equation}
And by denoting $\alpha_{\theta,i} = -\frac{\partial l(y_i, f_\theta(x_i))}{\partial f_\theta(x_i)}$, we have:
\begin{equation}
\frac{df_\theta(x)}{dt} = \sum_i \alpha_{\theta,i} K_\theta(x, x_i)
\end{equation}
It can be seen that the rate of change of the model $f_\theta(x)$ at each moment is an SVM. If we already know the trajectory of $\theta$ during the optimization process as $\theta(t), t \in [0, T]$, then the final model is:
\begin{equation}
f_{\theta_T}(x) = f_{\theta_0}(x) + \sum_i \int_0^T \alpha_{\theta(t),i} K_{\theta(t)}(x, x_i) dt \label{eq:sgdf}
\end{equation}
Results Analysis
After deriveing equation \eqref{eq:sgdf}, we have the theoretical solution for gradient descent when the learning rate approaches 0. From the derivation process, we see that this result only depends on gradient descent itself and is independent of the specific model structure. We can understand equation \eqref{eq:sgdf} from the following perspective.
First, we let $\beta(x) = f_{\theta_0}(x)$, which is the initial model. Although it theoretically depends on $x$, in many cases it behaves close to a constant (for example, in multi-class models, the output of the initialized model is usually close to a uniform distribution); thus, we can treat it as a constant term. Then, we can denote:
\begin{equation}
\alpha_i(x) = \frac{\int_0^T \alpha_{\theta(t),i} K_{\theta(t)}(x, x_i) dt}{\int_0^T K_{\theta(t)}(x, x_i) dt}, \quad K(x, x_i) = \int_0^T K_{\theta(t)}(x, x_i) dt
\end{equation}
Then:
\begin{equation}
f_{\theta_T}(x) = \beta(x) + \sum_i \alpha_i(x) K(x, x_i)
\end{equation}
This is very similar in form to an SVM. The difference is that in an SVM, $\alpha_i, \beta$ should be independent of $x$, whereas here they depend on $x$. We have already analyzed $\beta(x)$. As for $\alpha_i(x)$, since it is in the form of a mathematical expectation where the expected object does not depend on $x$ but the weights do, its dependence on $x$ might be relatively weak. Therefore, like $\beta(x)$, we might be able to approximately ignore its dependence on $x$. However, in my view, whether it depends on $x$ is not the key point. The most important thing is that the final result presents the form $\sum_i \alpha_i(x) K(x, x_i)$, which means that to some extent it has learned a process of retrieving the training set. This is the true similarity it shares with SVM.
The above discussion focuses on models with scalar outputs. If the output is a $d$-dimensional vector, the final form is identical, except that $\beta(x), \alpha_i(x)$ are also $d$-dimensional vectors, and $K(x, x_i)$ is a $d \times d$ matrix. In this case, even if $\beta(x)$ and $\alpha_i(x)$ are independent of $x$, it is still not the (multi-class) SVM model in the conventional sense. However, it still possesses the form $\sum_i \alpha_i(x) K(x, x_i)$, so in some sense, it remains an operation of retrieving the training set.
Furthermore, the above conclusion is for (full-batch) gradient descent. For Stochastic Gradient Descent (SGD), we no longer use the full dataset to calculate the loss function. We discussed this in the first article "Optimization Algorithms from a Dynamical Perspective (I): From SGD to Momentum Acceleration". We can consider that SGD introduces noise on top of gradient descent; that is, the convergence path $\theta(t)$ carries random noise, while the rest of the results remain basically unchanged. Therefore, the above conclusion also holds for SGD.
Extension Thinking
So, what kind of conceptual impact can this result bring us? The original paper devoted a considerable length to discussing this in the "Discussion" section. Let's reflect on this as well.
From the perspective of deep learning, this result reveals the connection between deep neural network models and traditional kernel methods, using the interpretability of kernel methods to enhance the interpretability of neural networks. For example, by using the inner product of gradients as a measure of similarity, we might be able to retrieve training samples from the training set that are similar to the input to explain the decision-making process of the output. Furthermore, if this direction can be more precisely quantified, it could greatly improve methods for incremental learning—that is, for new labeled samples, we might only need to find a way to add terms like $\alpha_i(x) K(x, x_i)$ to the model without retraining it from scratch.
Conversely, this result might promote the development of kernel machines and kernel methods. Traditional kernel functions rely on manual definitions, whereas the above gradient inner product form of kernel functions brings us new ideas for constructing kernel functions, enhancing the modeling capabilities of kernel methods for complex functions. At the same time, due to the similarity between gradient descent and kernel machines, we might eventually train kernel machines through gradient descent, thereby overcoming the difficulties of training kernel machines on large-scale data, and so on.
There are some other "brainstorms" to explore. For instance, we know that convex optimization problems have a unique solution, and theoretically, gradient descent can always find this solution. Since the previous argument states that gradient descent is equivalent to an SVM, does this imply that the solution to every convex optimization problem is equivalent to an SVM? Is this a big enough brainstorm?
In summary, revealing the connection between gradient descent and kernel machines helps in mutual learning and integration between the two, and potentially opens up new research avenues.