Reviewing Some Recent Non-Transformer Works

By 苏剑林 | May 24, 2021

Everyone has likely been bombarded lately by various MLP-related works. Led by Google, several research institutions have been "pulling out all the stops," attempting to "strike" the Transformer model from multiple dimensions. Among these, the most aggressive are a series of models claiming to be "pure MLP," creating a sensation as if the era of "MLP is all you need" has arrived.

Behind this dazzling array of operations, is it a "return to simplicity" under the principle of the Great Way, or is it merely "reheating cold leftovers" after running out of creative talent? Let's follow this trend and review some of the recent related works.

A Very Busy May

Strange things happen every day, but May has been particularly unusual. Since the beginning of this month, major institutions seem to have made a pact, as various non-Transformer works have made their debut, like "a sudden spring breeze that brings thousands of pear trees into bloom." Just among the papers I've come across on arXiv, there have been as many as seven (and the month isn't even over yet, with seven papers sharing extremely consistent directions), covering multiple tasks such as NLP and CV. It is truly overwhelming:

《MLP-Mixer: An all-MLP Architecture for Vision》 - Google Research

《Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks》 - Tsinghua University

《Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet》 - University of Oxford

《Are Pre-trained Convolutions Better than Pre-trained Transformers?》 - Google Research

《ResMLP: Feedforward networks for image classification with data-efficient training》 - Facebook AI

《FNet: Mixing Tokens with Fourier Transforms》 - Google Research

《Pay Attention to MLPs》 - Google Research

The above papers are listed in chronological order of their appearance on arXiv. It is evident that Google remains the main force. It was Google that single-handedly promoted the "Attention is all you need" trend, and now Google is the one "hitting hard" against Transformers. Google's experts truly never stop digging new pits.

Discussing Models Over Wine

So, what inspirations can this series of works bring? Should we rush to follow this line of work? In this section, we will briefly summarize the aforementioned papers to see what they are about and whether they might ignite a new model trend.

Synthesizer

To interpret the aforementioned MLP-related works, one must mention Synthesizer, published by Google in May last year: 《Synthesizer: Rethinking Self-Attention in Transformer Models》. In fact, if you are already familiar with Synthesizer, several of the papers in the list above can be glossed over.

In a previous blog post 《Google's New Work Synthesizer: We Still Don't Understand Self-Attention Well Enough》, we provided a simple interpretation of Synthesizer. Setting aside the scaling factor, the Attention operation can be decomposed as: \begin{equation}\boldsymbol{O}=\boldsymbol{A}\boldsymbol{V},\quad \boldsymbol{A}=softmax(\boldsymbol{B}),\quad \boldsymbol{B}=\boldsymbol{Q}\boldsymbol{K}^{\top}\end{equation} where $\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}$ are transformations of the input sequence. Readers familiar with Self-Attention should find this clear. Synthesizer experimented with several new algorithms for $\boldsymbol{B}$, the most impressive of which is called "Random," where the entire $\boldsymbol{B}$ is treated as a parameter matrix (either updated after random initialization or not updated at all).

In the Random case, the Attention matrix no longer varies with the sample; that is, all samples share the same Attention matrix. However, it still achieves good results, which at the time was a strong shock to everyone's inherent understanding of Attention. Synthesizer's experiments were quite extensive, including machine translation, automatic summarization, dialogue generation, and "pre-training + fine-tuning." It could be said that many of the later papers listed above do not have experiments as rich as those in Synthesizer.

MLP-Mixer

Synthesizer probably didn't expect that a year later, it would change its name and become famous.

The MLP-Mixer proposed in the paper 《MLP-Mixer: An all-MLP Architecture for Vision》 is actually the Random mode of Synthesizer with the softmax activation removed. In other words, it sets $\boldsymbol{B}$ as a trainable parameter matrix and directly lets $\boldsymbol{A}=\boldsymbol{B}$. That is the entire model. The only other difference is that MLP-Mixer is applied to CV tasks while Synthesizer was applied to NLP tasks.

By the way, why is the model called MLP-Mixer? Because the authors named this direct trainable Attention mode "token-mixing MLP" and renamed the original FFN to "channel-mixing MLP" (formerly called Position-wise FC). Whatever it's called, it claims to be just MLP, hence the name MLP-Mixer.

In fact, I believe a more standard name for this is a 1D convolution with a window size of 1. However, both this paper and the original 《Attention Is All You Need》 prefer to invent new names for conventional operations, selectively reducing or even ignoring the connection with convolution—it seems a great deal of effort was spent for the sake of "A Good Title Is All You Need."

Actually, this point was also criticized by Yann LeCun. If it were truly a standard MLP, the input should be flattened into a single-dimension vector before being connected to a transformation matrix.

External Attention

From an analogical perspective, Synthesizer's Random mode or MLP-Mixer is equivalent to setting both $\boldsymbol{Q}$ and $\boldsymbol{K}$ in Attention as parameter matrices. The "External Attention" proposed in 《Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks》 sets $\boldsymbol{K}$ and $\boldsymbol{V}$ as parameter matrices (of fixed size). The experimental tasks are also for CV.

This wouldn't be an issue in itself, as deep learning is performance-driven; if the results are good, the work is valid. However, individually, I find many of the claims in the External Attention paper difficult to support.

First, it calls itself "two linear layers," deliberately downplaying its connection to Attention (is it embarrassing to say it's a special case of Attention?). Then it says that "by introducing two external memory units (the $\boldsymbol{K}$ and $\boldsymbol{V}$ set as parameters), it implicitly learns the features of the entire dataset." This statement isn't exactly wrong, but any parameter of any model can be explained this way; it is not a specific trait of External Attention. Additionally, it claims to achieve linear complexity, but that requires fixing the length of $\boldsymbol{K}$ and $\boldsymbol{V}$. In this case, comparing it with LinFormer, which also has linear complexity, would be more persuasive (the paper compares with Performer, but Performer's approach to reducing complexity is different; LinFormer is a more relevant comparison).

Setting aside the semantics, the mechanism of External Attention seems a bit puzzling. It is not hard to see that the encoding of each feature in External Attention is isolated. If translated to NLP, it means each word is encoded independently without any connection to the context; thus, it surely wouldn't work. Why then does it work in CV?

Stack of FFN

As for the paper 《Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet》, it is highly redundant with MLP-Mixer, but it is written much more straightforwardly. It simply passes the input through a conventional FFN, transposes the output, passes it through another FFN, and finally transposes it back. If you are familiar with Transformers, you immediately understand what it did.

The paper itself is very short, totaling only 4 pages, including 1 page of code and half a page of references. The actual text is only 2.5 pages, making it more like a brief report. Perhaps the authors intended to dig deeper into this area, but Google's MLP-Mixer came out first, so there was no point in continuing, and they hurried to publish. (This narrative is purely my own speculation.)

Pre-trained CNN

In fact, CNNs were the first models to attempt replacing RNNs (in Seq2Seq tasks). Facebook's 《Convolutional Sequence to Sequence Learning》 was published earlier, but it was quickly overshadowed by Google's 《Attention Is All You Need》. After the release of models like GPT and BERT, Transformer-based models became the mainstream, and CNNs were rarely researched deeply.

The paper 《Are Pre-trained Convolutions Better than Pre-trained Transformers?》 helps us verify the effectiveness of "CNN + Pre-training." The results show that whether using downstream data for supervised training or fine-tuning after pre-training, CNN models based on dilated convolutions or dynamic convolutions slightly outperform Transformer models and are faster in terms of speed. By the way, this paper has already been accepted to ACL 2021, so it was actually completed earlier and only released this month.

The main inspiration this paper gives us is that "pre-training improvements" and "model improvements" should not be conflated. Pre-training technology itself often brings improvements to various models. We shouldn't think of Transformers whenever pre-training is mentioned, nor should we only combine pre-training with Transformers. In fact, I used to prefer CNNs and achieved good results on multiple tasks through the design of "Dilated Gated Convolution" (DGCNN). This paper reaffirms the value of CNNs once again. However, despite this, I likely won't devote my primary energy to CNN research.

First, theoretically, CNNs cannot capture sufficiently long-range dependencies, which is a fundamental flaw. Although dilated convolutions can rapidly increase the receptive field, they are only "relatively large"—not the "one-step reaching" of Transformers. Second, from the perspective of simply improving efficiency, Transformers themselves have plenty of room for optimization. Switching to CNNs solely for execution efficiency doesn't seem very convincing. Moreover, the $\mathcal{O}(n^2)$ complexity of Transformers provides more room for "tweaking" (like UniLM) and allows for more variety (like K-BERT).

In summary, we cannot deny the value of CNNs, but if one is already quite focused on Transformers, there is no need to divert too much energy toward CNNs.

ResMLP

As for ResMLP proposed by Facebook in 《ResMLP: Feedforward networks for image classification with data-efficient training》, it has no essential difference from the aforementioned MLP-Mixer and Stack of FFN. Its description is also very similar to Stack of FFN. Ignoring minor details, one could even consider the three to be the same model. Finally, ResMLP's experimental tasks were also in CV.

FNet

In my view, 《FNet: Mixing Tokens with Fourier Transforms》 is the most interesting paper on the list. In a sense, FNet is also a special case of MLP-Mixer, but it is a very intriguing one: the Attention matrix in MLP-Mixer is directly optimized through parameters, whereas the parameter matrix in FNet is obtained directly through the Fourier Transform! Therefore, the "Attention layer" in FNet has no trainable parameters!

We can also understand FNet from the perspective of Attention. Setting aside the normalization factor, the Attention operation can be roughly written as: \begin{equation}\boldsymbol{O}=\boldsymbol{A}\boldsymbol{V},\quad \boldsymbol{A}=\exp(\boldsymbol{B}),\quad \boldsymbol{B}=\boldsymbol{Q}\boldsymbol{K}^{\top}\end{equation} Here, $\boldsymbol{Q}, \boldsymbol{K}$ were originally $n\times d$ matrices. FNet suggests that $\boldsymbol{Q}, \boldsymbol{K}$ can be replaced by $n\times 1$ matrices: \begin{equation}\boldsymbol{Q}=\boldsymbol{K}=\begin{pmatrix}0 \\ 1 \\ 2 \\ \vdots \\ n - 1\end{pmatrix}\end{equation} Yes, you read that right; it crudely replaces them with an $n\times 1$ matrix consisting of $0 \sim n-1$. Of course, this would cause $\exp(\boldsymbol{B})$ to explode exponentially. To avoid this, FNet changes it to: \begin{equation}\boldsymbol{A}=\exp(\text{i}\boldsymbol{B})\end{equation} By using a complex exponential, it won't explode! It's that straightforward, and that's how we get the Fourier Transform-based FNet. The original paper applied Fourier Transforms along both the sequence length and feature dimension directions and then kept only the real part, using this operation to replace Self-Attention. For the implementation of the Fourier Transform, we have the "Fast Fourier Transform (FFT)" algorithm with an efficiency of $\mathcal{O}(n\log n)$, so FNet can effectively handle long sequences.

From the results of pre-training and downstream tasks, FNet does not have much of an advantage. However, its performance on the Long-Range Arena (a benchmark testing long-range capabilities of models) is quite good.

Of course, the fact that such a crude approach in FNet works at all is a miracle. The biggest shock it brings is undoubtedly: even this works? Why does the Fourier Transform work? I don't know the answer either. Some online comments say this indicates that the Attention mechanism is actually a type of coordinate basis transformation, and since the Fourier Transform is also a basis transformation, the effects are similar. This explanation feels quite essential. There is also a paper in ICLR 2021, "Is Attention Better Than Matrix Decomposition?", which achieved good results using SVD to replace Attention. This suggests the idea of basis transformation is valid (SVD is also a basis transformation), but how to maintain sequentiality during basis transformation and which basis transformation is more suitable are questions for which we have no clear clues yet.

gMLP / aMLP

Finally, gMLP and aMLP introduced in 《Pay Attention to MLPs》 are relatively conventional structural exploration works, viewed as enhanced versions of MLP-Mixer. The "g" in gMLP stands for "gate." Simply put, gMLP combines MLP-Mixer with a gating mechanism, while the "a" in aMLP stands for "attention," combining Attention with gMLP.

Specifically, gMLP operates roughly as follows: \begin{equation}\begin{aligned} &[\boldsymbol{X}_1, \boldsymbol{X}_2] = \boldsymbol{X} \\ &\boldsymbol{Y} = \boldsymbol{W}\boldsymbol{X}_2 + \boldsymbol{b} \\ &\boldsymbol{O} = \boldsymbol{X}_1 \otimes \boldsymbol{Y} \end{aligned}\end{equation} In simple terms, the input is split in half along the feature dimension, and one half is passed into MLP-Mixer to serve as the gate for the other half. For aMLP, it combines MLP-Mixer with a simple single-head Self-Attention to act as the gate: \begin{equation}\begin{aligned} &[\boldsymbol{X}_1, \boldsymbol{X}_2] = \boldsymbol{X} \\ &\boldsymbol{Y}_1 = \boldsymbol{W}\boldsymbol{X}_2 + \boldsymbol{b} \\ &\boldsymbol{Y}_2 = SelfAttention(\boldsymbol{X}) \\ &\boldsymbol{O} = \boldsymbol{X}_1 \otimes (\boldsymbol{Y}_1 + \boldsymbol{Y}_2) \end{aligned}\end{equation}

The paper conducts fairly comprehensive experiments, including those for CV and NLP. From the reported results, gMLP slightly underperforms compared to standard Self-Attention, while aMLP generally outperforms it, further confirming the value of gating mechanisms. However, whether it's gMLP or aMLP, there's a heavy sense of artificial construction. It's fine for churning out a paper, but in my opinion, it doesn't bring any new inspiration to the development of models.

Where Is the Path Forward?

Through the above reading, we can see that MLP-Mixer, Stack of FFN, and ResMLP are effectively special cases of last year's Synthesizer. Technically speaking, they are not even as rich in content as Synthesizer. Therefore, they don't really count as interesting work. As for the improved versions gMLP / aMLP, they are very conventional structural "alchemy" works—anyone with enough computing power could do them. Thus, they aren't very interesting either. External Attention claims to be "two linear layers," but it is actually a variant of Attention, and its effectiveness and experimental comparisons are not yet clear. The most interesting works are CNN pre-training and FNet: one decouples the concepts of "pre-training improvement" and "model improvement," and the Fourier Transform proposed by the other provides a significant conceptual shock.

Overall, these works are far from mature; at most, they have preliminarily verified effectiveness, and they cannot even be called elegant. For example, except for FNet, these so-called "all-in MLP" models cannot elegantly handle variable-length inputs. Models like MLP-Mixer, Stack of FFN, and ResMLP were tested purely on (fixed-size) images, so they didn't have to consider this issue. While Synthesizer / gMLP / aMLP did NLP experiments, they appear to rely on forced truncation, which isn't beautiful. Thus, while this series of works has opened up new ideas to some extent, it has actually brought more questions yet to be answered.

So, should we follow them? Personally, I don't think it's necessary to invest too much energy; just paying general attention is fine. Setting aside the issue of elegance, the practicality of these works is questionable. The biggest advantage of replacing Attention with MLP is speed; yes, it is a bit faster, but the theoretical complexity is still $\mathcal{O}(n^2)$, meaning there is no essential improvement. Moreover, the speed increase usually comes at the cost of a slight performance hit. If the goal is just "speeding up with a slight reduction in performance," Transformers have plenty of work that can be done (the most direct being removing a layer or two). There's no need to switch to MLP, and switching to MLP reduces the degrees of freedom for exploration. Of course, from the academic perspective of "pioneering," it is meaningful to try various new models from different angles. However, it is not advisable to inject too many artificial factors into them, otherwise, it becomes a process of over-fitting the structure to the task, and such works would hardly reach the highest academic standards.

Furthermore, for NLP, we are mostly concerned with "pre-training + fine-tuning" performance. Unfortunately, the series of NLP experiments starting from Synthesizer shows that while an MLP-based model might achieve competitive results on a specific task, its transferability is often poor. That is to say, while the pre-training effect might look decent, "pre-training + fine-tuning" usually falls behind Transformers. This is not hard to understand: by parameterizing the Attention matrix, the matrix becomes more likely to be strongly associated with specific tasks, unlike the self-adaptive Attention matrix generated by Transformers, which has better adaptability.

Conclusion: When the Music Ends

This article has reviewed some recent "non-mainstream" works, primarily focusing on replacing Transformers with non-Transformer structures—mainly based on MLP—and achieving competitive results. Overall, these works appear diverse but follow recognizable patterns, giving a sense of "new wine in old bottles." Not many can provide truly new insights.

This entire article consists only of my personal reflections, representing my own viewpoints. Should there be any improprieties, I hope readers will kindly offer their corrections.