By 苏剑林 | December 11, 2020
Since SimCLR, work on unsupervised feature learning in CV (Computer Vision) has emerged endlessly, making it hard to keep up. Most of these works are based on contrastive learning, which involves categorical learning by constructing positive and negative samples in appropriate ways. However, among many similar works, there are always some unique ones, such as Google's BYOL and the more recent SimSiam. They propose schemes that can complete feature learning relying only on positive samples, which feels refreshing. But without the support of negative samples, why doesn't the model collapse to a meaningless constant model? This is the most thought-provoking question in these two papers.
SimSiam provides an answer that many have praised, but I feel that SimSiam only changes the way the question is phrased without truly solving it. I believe the success of models like SimSiam and GAN is largely due to the use of gradient-based optimizers (rather than other stronger or weaker optimizers). Therefore, any answer that doesn't integrate optimization dynamics is incomplete. Here, I try to analyze the reason why SimSiam does not collapse by combining it with dynamics.
SimSiam
Before looking at SimSiam, we can first look at BYOL from the paper "Bootstrap your own latent: A new approach to self-supervised Learning". Its learning process is simple: maintain two encoders, Student and Teacher, where the Teacher is an exponential moving average (EMA) of the Student, and the Student in turn learns from the Teacher. It feels like "stepping on your left foot with your right foot to fly up." The schematic is as follows:

SimSiam, from the paper "Exploring Simple Siamese Representation Learning", is even simpler. It directly removes the moving average of BYOL:

In fact, SimSiam is equivalent to setting the moving average parameter $\tau$ of BYOL to 0, which shows that the moving average is not strictly necessary. To find the key parts of the algorithm, SimSiam also conducted many comparative experiments, confirming that the stop_gradient operator and the predictor module $h_{\varphi}(z)$ are key to SimSiam's non-collapse. To explain this phenomenon, SimSiam proposed that the optimization process is actually equivalent to alternating optimization of:
\begin{equation}\mathcal{L}(\theta, \eta)=\mathbb{E}_{x, \mathcal{T}}\left[\left\|\mathcal{F}_{\theta}(\mathcal{T}(x))-\eta_{x}\right\|^2\right]\label{eq:simsiam}\end{equation}
where $x$ represents the training samples and $\mathcal{T}$ represents data augmentation. There are already many interpretations of this part online, and reading the original paper is not difficult, so I won't expand on it in detail.
Dynamic Analysis
However, I believe that translating the understanding of the SimSiam algorithm into an understanding of the alternating optimization of $\mathcal{L}(\theta, \eta)$ is merely a change of terminology and does not provide a substantive answer. This is because, clearly, $\mathcal{L}(\theta, \eta)$ also has a collapsed solution: the model could simply let all $\eta_{x}$ equal the same vector and then have $\mathcal{F}_{\theta}$ output the same constant vector. If we don't answer why the alternating optimization of $\mathcal{L}(\theta, \eta)$ doesn't collapse, we haven't answered the question.
Below, I will list what I consider to be the key factors in SimSiam's non-collapse and demonstrate through a simple example that answering the reason for non-collapse needs to be combined with dynamics. Of course, my discourse in this part is also incomplete and perhaps not even rigorous; it's simply offering a new perspective to "cast a brick to attract jade."
Deep Image Prior
First, people discovered long ago that a randomly initialized CNN model can be directly used to extract visual features, and the effect is not particularly bad. This conclusion can be traced back to the 2009 paper "What is the best multi-stage architecture for object recognition?". This can be understood as CNNs having an innate ability to process images. Later, this characteristic was given a high-sounding name called "Deep Image Prior," from the paper "Deep Image Prior". It performed some experiments showing that starting from a randomly initialized CNN model, tasks like image completion and denoising can be accomplished without any supervised learning, further confirming the property that CNNs naturally possess the ability to process images.
According to my understanding, the "Deep Image Prior" stems from three points:
1. Continuity of images: This means that images themselves can be directly viewed as continuous vectors without needing to learn an Embedding layer like in NLP. This means we can perform many tasks using simple methods like "original image + K-nearest neighbors";
2. CNN architectural prior: This refers to the local receptive field design of CNNs, which well simulates the visual processing of the human eye. Since the visual classification results we provide are based on our own vision, the two are consistent;
3. Good initialization: This is not hard to understand. Even the best model won't work with zero initialization. My previous article "Understanding Model Parameter Initialization Strategies from a Geometric Perspective" also briefly discussed initialization methods. From a geometric perspective, mainstream initialization methods are a type of approximate "orthogonal transformation," which can preserve the information of input features as much as possible.
The Dynamics of Non-Collapse
Again, the Deep Image Prior means that a randomized CNN model is a "not-so-bad" encoder from the start. Thus, what we need to do next can be summarized into two points: learn in a better direction, and don't collapse toward a constant.
Learning in a better direction involves designing certain prior signals to better integrate prior knowledge into the model. SimSiam, BYOL, etc., use two different data augmentations for the same image and make their corresponding feature vectors as similar as possible. This is a good signal guidance, telling the model that simple transformations should not affect our visual understanding. In fact, this is one of the designs used by all contrastive learning methods.
The difference lies in "not collapsing toward a constant." General contrastive learning methods use negative samples to tell the model which images' features should not be similar, thereby preventing collapse. However, SimSiam and BYOL are different; they have no negative samples. In reality, they prevent collapse by decomposing the model's optimization process into two synchronous but differently paced modules. Taking SimSiam as an example, its optimization objective can be written as:
\begin{equation}\mathcal{L}(\varphi, \theta)=\mathbb{E}_{x, \mathcal{T}_1,\mathcal{T}_2}\Big[l\left(h_{\varphi}(f_{\theta}(\mathcal{T}_1(x))), f_{\theta}(\mathcal{T}_2(x))\right)\Big]\end{equation}
Then, using gradient descent for optimization, the corresponding dynamic equations are:
\begin{equation}\begin{aligned}
\frac{d\varphi}{dt} = - \frac{\partial\mathcal{L}}{\partial \varphi} =& -\mathbb{E}_{x, \mathcal{T}_1,\mathcal{T}_2}\bigg[\frac{\partial l}{\partial h}\frac{\partial h}{\partial \varphi}\bigg]\\
\frac{d\theta}{dt} = - \frac{\partial\mathcal{L}}{\partial \theta} =& -\mathbb{E}_{x, \mathcal{T}_1,\mathcal{T}_2}\bigg[\frac{\partial l}{\partial h}\frac{\partial h}{\partial f}\frac{\partial f}{\partial \theta} \color{skyblue}{\,+\underbrace{\frac{\partial l}{\partial f}\frac{\partial f}{\partial \theta}}_{\substack{\text{SimSiam}\text{removed this}}}}\bigg]
\end{aligned}\end{equation}
The above formula indicates the difference brought by the presence or absence of the stop_gradient operator. Simply put, if the stop_gradient operator is added, then $\frac{d\theta}{dt}$ loses the second term. In this case, $\frac{d\varphi}{dt}$ and $\frac{d\theta}{dt}$ both share the factor $\frac{\partial l}{\partial h}$. Since $h_{\varphi}$ is closer to the output layer and the initialized $f_{\theta}$ is already a good encoder, at the start of learning, $h_{\varphi}$ will be optimized faster, while the parts closer to the input optimize slower. In other words, $\frac{d\varphi}{dt}$ is the fast dynamic component, while $\frac{d\theta}{dt}$ is the slow dynamic component. Relatively speaking, $\frac{d\varphi}{dt}$ will converge to 0 more quickly, which means $\frac{\partial l}{\partial h}$ will become very small very fast. Since $\frac{d\theta}{dt}$ also contains the term $\frac{\partial l}{\partial h}$, $\frac{d\theta}{dt}$ also becomes small. Before it can collapse, the force driving it to collapse has become negligible, so it does not collapse. Conversely, if there were the second term $\frac{\partial l}{\partial f}\frac{\partial f}{\partial \theta}$ (whether added back or if only this term were kept), it would be equivalent to adding a "fast track," making it a fast term. Even if $\frac{\partial l}{\partial h}=0$, the presence of the second term would still continue to drive it toward collapse.
To give a simple specific example, consider:
\begin{equation}l = \frac{1}{2}(\varphi\theta - \theta)^2\end{equation}
For simplicity, let $\varphi, \theta$ both be scalars. The corresponding dynamic equations are:
\begin{equation}\frac{d\varphi}{dt}=-(\varphi\theta - \theta)\theta, \quad \frac{d\theta}{dt}=-(\varphi\theta - \theta) \varphi \color{skyblue}{+ \underbrace{(\varphi\theta - \theta)}_{\substack{\text{SimSiam}\text{removed this}}}}\end{equation}
Suppose $\varphi(0)=0.6, \theta(0)=0.1$ (chosen arbitrarily). The evolution of both is:
 Stopped gradient for second $\theta$ |
 Did not stop gradient for second $\theta$ |
As can be seen, after stopping the gradient of the second $\theta$, the equations for $\varphi$ and $\theta$ are quite consistent; $\varphi$ quickly tends toward 1, while $\theta$ stabilizes at a non-zero value (meaning no collapse). Conversely, if the second term in $\frac{d\theta}{dt}$ is added, or even if only the second term is kept, the result is that $\theta$ quickly tends toward 0 and $\varphi$ fails to tend toward 1. This means the dominance is taken by $\theta$.
This example itself may not be very persuasive, but it simply reveals the changes in the dynamics:
The introduction of the predictor ($\varphi$) splits the model's dynamics into two major parts, and the inclusion of the stop_gradient operator makes the encoder part ($\theta$)'s dynamics slower and enhances the synchronization between the encoder and predictor. In this way, the predictor fits the target with "lightning speed," such that the optimization process stops before the encoder has time to collapse.
Looking at Approximate Expansion
Of course, there are a thousand interpretations, and they are all "hindsight." The ones who are truly great are the discoverers; at most, we are just riding the coattails. Here, I'll ride a bit more by sharing another perspective on SimSiam. As mentioned at the beginning, the SimSiam paper proposed explaining SimSiam through the alternating optimization of the objective in $\eqref{eq:simsiam}$. This perspective starts from the objective in $\eqref{eq:simsiam}$ and further investigates the reason for its non-collapse.
If $\theta$ is fixed, for the objective $\eqref{eq:simsiam}$, it is easy to solve for the optimal value of $\eta_x$ as:
\begin{equation}\eta_x=\mathbb{E}_{\mathcal{T}}\left[\mathcal{F}_{\theta}(\mathcal{T}(x))\right]\end{equation}
Substituting this back into $\eqref{eq:simsiam}$, we get the optimization objective as:
\begin{equation}\mathcal{L}(\theta)=\mathbb{E}_{x, \mathcal{T}}\bigg[\left\|\mathcal{F}_{\theta}(\mathcal{T}(x))-\mathbb{E}_{\mathcal{T}}\left[\mathcal{F}_{\theta}(\mathcal{T}(x))\right]\right\|^2\bigg]\end{equation}
If we assume that $\mathcal{T}(x)-x$ is a "small" vector, then performing a first-order expansion at $x$ gives:
\begin{equation}\mathcal{L}(\theta)\approx\mathbb{E}_{x, \mathcal{T}}\bigg[\left\|\frac{\partial \mathcal{F}_{\theta}(x)}{\partial x}\big(\mathcal{T}(x)-\bar{x}\big)\right\|^2\bigg]\label{eq:em-sim}\end{equation}
where $\bar{x}=\mathbb{E}_{\mathcal{T}}\left[\mathcal{T}(x)\right]$ is the average result of the same image under all data augmentation methods. Note that it usually does not equal $x$. Similarly, for a version of SimSiam without stop_gradient and without a predictor, the loss function is approximately:
\begin{equation}\mathcal{L}(\theta)\approx\mathbb{E}_{x, \mathcal{T}_1, \mathcal{T}_2}\bigg[\left\|\frac{\partial \mathcal{F}_{\theta}(x)}{\partial x}\big(\mathcal{T}_2(x)-\mathcal{T}_1(x)\big)\right\|^2\bigg]\label{eq:em-sim-2}\end{equation}
In equation $\eqref{eq:em-sim}$, each $\mathcal{T}(x)$ is subtracted by $\bar{x}$, and it can be proven that this choice minimizes the loss function. In equation $\eqref{eq:em-sim-2}$, each $\mathcal{T}_1(x)$ is subtracted by another augmented result $\mathcal{T}_2(x)$, which causes both the loss function itself and the variance of its estimate to increase significantly.
Does this mean that the reason why not using stop_gradient and not using a predictor fails is because the loss function and its variance are too large? Noticing that under the first-order approximation, $\eta_x\approx \mathcal{F}_{\theta}(\bar{x})$, if the optimization objective is changed to:
\begin{equation}\mathcal{L}(\theta)=\mathbb{E}_{x, \mathcal{T}}\bigg[\left\|\mathcal{F}_{\theta}(\mathcal{T}(x))-\mathcal{F}_{\theta}(\bar{x})\right\|^2\bigg]\end{equation}
would it still not collapse? I have not verified this and do not know. Readers who are currently studying related content might want to verify it. This also leads to a related question: for an encoder trained this way, is it better to use $\mathcal{F}_{\theta}(x)$ or $\mathcal{F}_{\theta}(\bar{x})$ as the feature?
Of course, this part of the discussion is built on the assumption that "$\mathcal{T}(x)-x$ is a small vector." If it doesn't hold, then this section is in vain.
Summary
This article attempts to provide my understanding of why the BYOL and SimSiam algorithms do not collapse from a dynamic perspective. Unfortunately, halfway through writing, I realized that some of the analyses I had conceived could not be made self-consistent, so I deleted some content and added a new angle, trying to ensure the article didn't "end poorly." As for quality, that's another matter. I am sharing this as a note here and hope readers will be tolerant and offer corrections if there are any inadequacies.