By 苏剑林 | November 30, 2022
Leveraging theoretical physics to push machine learning is no longer a new phenomenon. For example, the article introduced last month, "General Discourse on Generative Diffusion Models (13): From Universal Gravitation to Diffusion Models," is a classic case. Recently, a new paper titled "Self-Supervised Learning based on Heat Equation" has piqued my interest. As the name suggests, it uses the heat conduction equation for self-supervised learning in the field of computer vision. How do these physical equations play a role in machine learning? Can the same ideas be transferred to NLP? Let's read the paper together.
As shown in the figure below, on the left is the solution to the heat conduction equation in physics, and on the right is the attribution heatmap obtained through saliency methods such as CAM and Integrated Gradients. It can be seen that there is a certain similarity between the two. Consequently, the authors believe that the heat conduction equation can serve as a good prior for visual features.
Heatmap of the heat equation (left) and heatmap of a vision model (right)
Specifically, the physical heat conduction equation is:
\begin{equation}\frac{\partial u}{\partial t} = \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\end{equation}where $x, y$ correspond to the "width" and "height" dimensions of the image, and $u$ corresponds to the feature value at that location. Since this paper mainly deals with static images rather than videos, there is no time dimension $t$. For this, we can simply let $\frac{\partial u}{\partial t}=0$. Since features are usually multi-dimensional vectors rather than scalars, we replace $u$ with $\boldsymbol{z}$, obtaining:
\begin{equation}\frac{\partial^2 \boldsymbol{z}}{\partial x^2} + \frac{\partial^2 \boldsymbol{z}}{\partial y^2} = 0\label{eq:laplace}\end{equation}This is known as the "Laplace Equation." It is isotropic, but images are not always isotropic. Therefore, we can supplement it with an $\boldsymbol{S}$ matrix to capture this anisotropy:
\begin{equation}\frac{\partial^2 \boldsymbol{z}}{\partial x^2} + \boldsymbol{S}\frac{\partial^2 \boldsymbol{z}}{\partial y^2} = 0\label{eq:laplace-s}\end{equation}However, this is a second-order equation, and as we will see later, it would be troublesome to discretize. Thus, the authors propose further transforming it into a system of first-order equations:
\begin{equation}\frac{\partial \boldsymbol{z}}{\partial x} = \boldsymbol{A}\boldsymbol{z},\quad \frac{\partial \boldsymbol{z}}{\partial y} = \boldsymbol{B}\boldsymbol{z}\label{eq:laplace-1o}\end{equation}It can be verified that as long as $\boldsymbol{S} = -\boldsymbol{A}^2(\boldsymbol{B}^2)^{-1}$, the solution to the above equation must also be a solution to equation $\eqref{eq:laplace-s}$. Therefore, the original paper takes equation $\eqref{eq:laplace-1o}$ as its starting point.
Despite all the setup, the core idea of the original paper is quite simple: it assumes that the features obtained from the original image after passing through an encoder should satisfy equation $\eqref{eq:laplace-1o}$ as much as possible. Specifically, after the image passes through the encoder and before global pooling, we obtain a feature map of size $w \times h \times d$. We view this as $m \times n$ vectors of $d$-dimensions, or a function $\boldsymbol{z}(x, y) \in \mathbb{R}^d$, where $(x, y)$ is the position of the vector. Then the function $\boldsymbol{z}(x, y)$ should satisfy equation $\eqref{eq:laplace-1o}$ as much as possible.
How do we facilitate this? According to equation $\eqref{eq:laplace}$, we can derive the discretization format:
\begin{equation}\begin{aligned} &\,\boldsymbol{z}(x+\Delta x, y) \approx \boldsymbol{z}(x, y) + \Delta x \boldsymbol{A}\boldsymbol{z}(x,y) = (\boldsymbol{I} + \Delta x \boldsymbol{A})\boldsymbol{z}(x,y) \\ &\,\boldsymbol{z}(x, y+\Delta y) \approx \boldsymbol{z}(x, y) + \Delta y \boldsymbol{B}\boldsymbol{z}(x,y) = (\boldsymbol{I} + \Delta y \boldsymbol{B})\boldsymbol{z}(x,y) \end{aligned}\label{eq:laplace-delta}\end{equation}This means we can predict the features of adjacent positions through the features of the current position. Thus, the original paper proposes a self-supervised learning method named "QB-Heat": only a small portion of the image is input at a time, corresponding features are obtained via the encoder, features for the complete image are predicted through discretized equation $\eqref{eq:laplace-delta}$, and then the features are passed to a small decoder to reconstruct the complete image.
The schematic diagram is as follows:
Schematic of the QB-Heat framework
That concludes the introduction to QB-Heat. The remainder of the original paper consists of experimental results and some (in my opinion) largely irrelevant analysis, which I will skip here. Interested readers can refer directly to the original paper.
If readers have read about the MAE model (refer to "MLM and MAE from the Perspective of Dropout: Some New Insights"), they should feel that QB-Heat bears many similarities to MAE—both input partial images into an encoder and then reconstruct the full image, and both use a large encoder and a small decoder. Besides the masking method, the biggest difference lies in the input to the decoder: QB-Heat predicts features for the remaining parts of the image by approximating $\eqref{eq:laplace-delta}$, whereas MAE simply treats the remaining features as the same [MASK]. It is conceivable that approximating through $\eqref{eq:laplace-delta}$ is naturally more reasonable than crudely filling with [MASK], so it is within reason that QB-Heat performs better than MAE.
Schematic of the MAE model
Equation $\eqref{eq:laplace-delta}$ dictates that QB-Heat can only predict the surroundings from the center (otherwise handling interpolation in the middle is more troublesome). Therefore, the masking method for QB-Heat is limited to retaining a continuous square area and masking the surrounding area, as shown in the figure below. Precisely because the input for QB-Heat is a continuous sub-image of the original image, its encoder can be built using either a Transformer or a pure CNN model. In contrast, MAE randomly masks some pixels of the original image. Consequently, to achieve the effect of reducing encoder computation, MAE's encoder can only use a Transformer model, because only the Transformer model can reduce sequence length while retaining positional information.
Schematic of the Masking method in QB-Heat
The physics perspective looks beautiful, but often it serves as a "guise" (in a non-derogatory sense). It is more important for us to see through the phenomenon to the essence and consider the actual mechanism that makes it work.
First, an obvious "point of critique" for QB-Heat is that while the title and the method are crowned with the name of the heat conduction equation, the heat equation's actual appearance time "doesn't exceed 3 seconds," giving it a feeling of being optional. In fact, the starting point of the paper should be equation $\eqref{eq:laplace}$, which is the Laplace equation. Although the Laplace equation is mathematically equivalent to the steady-state solution of the heat conduction equation, they belong to two different branches in both mathematical and physical classification and research. Therefore, the name "heat conduction equation" is a bit forced. Secondly, it wasn't the original formula $\eqref{eq:laplace}$ or $\eqref{eq:laplace-s}$ that was used, but rather the simplified formula $\eqref{eq:laplace-1o}$, and during application, it corresponded to the approximation formula $\eqref{eq:laplace-delta}$. Leaving the physical background aside and looking directly at equation $\eqref{eq:laplace-delta}$, it states the following hypothesis: adjacent feature vectors should be as similar as possible, and the difference between them should ideally be the same linear transformation.
Simply put, it applies explicit prediction to feature vectors through assumptions of continuity and linearity, thereby playing an implicit role of regularization. This reminds me of mixup, introduced in "From SamplePairing to mixup: Miraculous Regularization Terms," which also adds implicit linear regularization to the model by explicitly constructing data, thereby enhancing the final generalization ability of the model.
For me, whenever I see a method in CV, I usually wonder if it can be transferred to NLP. Is it possible for QB-Heat to make this migration? Compared to MAE, the biggest change QB-Heat makes is that the features of the remaining part of the original image should be predicted through certain hypotheses rather than uniformly replaced by [MASK]. QB-Heat uses continuity and linearity assumptions for CV; can this be replicated for NLP? Language is essentially a time series with only one dimension of change. This is equivalent to asking: can we assume that the sentence vectors of adjacent sentences differ by the same linear transformation? It seems that natural language shouldn't possess such good continuity, but if understood purely from the perspective of linear regularization, it doesn't seem unfeasible, especially since mixup works well in many NLP tasks.
Additionally, if we randomly mask a part of the tokens instead of retaining only one continuous sub-interval like QB-Heat, it seems we could also directly use linear interpolation of feature vectors from both sides to predict the features of the middle positions. In this way, it would also satisfy the continuity and linearity assumptions. I wonder if this treatment would yield better results? These are relatively shallow thoughts that await experimental verification.
This article introduced QB-Heat, a scheme that uses the heat conduction equation to guide self-supervised learning. Its difference from MAE lies in using simple prediction instead of [MASK] as the features for the remaining parts of the image passed to the decoder.