SVD Decomposition (III): Even Word2Vec is Just an SVD?

By 苏剑林 | February 23, 2017

This article brings some "heavyweight" news, as the title suggests: even the famous deep learning word vector tool, Word2Vec, is essentially just an SVD!

Of course, super loyal fans of Word2Vec need not get too excited. We are only saying that the model structures are equivalent, not that they are identical in every way. Word2Vec still has its unique aspects. However, once explained in this way, many problems can likely be understood through similar logic.

Word Vector = One Hot

Let's first review an article from last year, "What is actually going on with Word Vectors and Embedding?". The main point of that article was: the so-called Embedding layer is simply a fully connected layer with a one-hot input (again, emphasizing that this is "identical," not just "equivalent to"), and word vectors are the parameters of this fully connected layer. As for Word2Vec, it trains the Embedding layer through a greatly simplified language model to obtain word vectors (it uses many optimization tricks, but its model structure is that simple). Word vectors can reduce the risk of overfitting because tools like Word2Vec use large-scale corpora to pre-train this Embedding layer in an unsupervised manner; it has nothing to do with whether you use one-hot, Embedding, or the word vectors themselves.

With this perspective, we can immediately explain why a method we used previously works. In sentiment classification problems, if you have word vectors and want to obtain a sentence vector, the simplest scheme is to directly sum or average the word vectors of the words in the sentence. This can achieve about 85% accuracy. In fact, this is also the approach used by Facebook’s text classification tool, FastText (FastText also introduces ngram features to alleviate word order issues, but generally speaking, it still averages feature vectors to get sentence vectors). Why does such a seemingly non-intuitive, crude scheme achieve such decent accuracy?

Returning to the era of one-hot encoding, this is how we represented a sentence. Suppose we use the following one-hot encoding for the six words "我" (I), "爱" (Love), "科学" (Science), "空间" (Space), "不" (Not), and "错" (Bad):

\[\begin{array}{c|c}\hline\text{我 (I)} & [1, 0, 0, 0, 0, 0]\\ \text{爱 (Love)} & [0, 1, 0, 0, 0, 0]\\ \text{科学 (Science)} & [0, 0, 1, 0, 0, 0]\\ \text{空间 (Space)} & [0, 0, 0, 1, 0, 0]\\ \text{不 (Not)} & [0, 0, 0, 0, 1, 0]\\ \text{错 (Bad)} & [0, 0, 0, 0, 0, 1]\\ \hline \end{array}\]

Then, disregarding word order, the phrase "我爱空间科学" (I love space science) can be represented by the following vector (Bag of Words):

\[\begin{pmatrix}1 & 1 & 1 & 1 & 0 & 0\end{pmatrix}\]

With this vector, to use a neural network for classification, we can follow it with a fully connected layer with 3 hidden nodes:

\begin{aligned}&\begin{pmatrix}1 & 1 & 1 & 1 & 0 & 0\end{pmatrix}\begin{pmatrix}w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\\ w_{31} & w_{32} & w_{33}\\ w_{41} & w_{42} & w_{43}\\ w_{51} & w_{52} & w_{53}\\ w_{61} & w_{62} & w_{63}\end{pmatrix}\\ =&\begin{pmatrix}w_{11}+w_{21}+w_{31}+w_{41} & w_{12}+w_{22}+w_{32}+w_{42} & w_{13}+w_{23}+w_{33}+w_{43}\end{pmatrix}\end{aligned}

Wait? Isn't this just taking the first four vectors and adding them together?

The situation is now clear. If we use a traditional Bag of Words model, ignore word order, and then append a fully connected layer—and to prevent overfitting, we replace the parameters of this fully connected layer with pre-trained word vectors—then the result is equivalent to directly extracting the corresponding word vectors and summing them! In other words, summing word vectors to get a sentence vector is actually the equivalent of the traditional Bag of Words model!

Word2Vec = SVD?

From beginning to end, whenever I mention the equivalence between Word2Vec and SVD, I include a question mark. This is because their model structures are equivalent, but their implementation methods are different. Whether this counts as "equivalence" depends on how the reader defines the word.

In fact, the concept of word vectors has been around for a long time. At that time, it wasn't called Word Embedding; it was called "distributed representation." The original meaning was that the context of a word helps us understand the word itself. Now assume the total vocabulary has $N$ words. Using one-hot encoding means representing each word as an $N$-dimensional vector. A distributed representation, on the other hand, involves imagining a window (several words before and after plus the current word) and then counting the distribution of the surrounding words. This distribution is used to represent the current word, and this distribution can also be represented as a corresponding $N$-dimensional vector. Because a word is represented through its contextual distribution rather than isolated "one-hot" encoding, it can express semantic correlations. However, the problem is that it is still $N$-dimensional; the dimension is too large, and the entire word vector table (co-occurrence matrix) is too sparse.

What should we do? Actually, mathematicians already had a solution. For sparse matrices, a scheme that can simultaneously achieve dimensionality reduction and improve generalization ability is to perform an SVD decomposition on the matrix. As the first article in this series stated, SVD decomposition is equivalent to a three-layer autoencoder. Looking at it today, this scheme suggests: the original distributed word vectors are $N$-dimensional, which is too large; we can use an autoencoder to reduce the dimensionality. Denote the number of hidden nodes in the autoencoder as $n$. By setting $n$ to an appropriate value and training the autoencoder, the $n$-dimensional results of the hidden layer can be used directly as the new word vectors. Therefore, this is an autoencoder scheme with $N$-dimensional input, $n$ middle nodes, and $N$-dimensional output, which is also equivalent to an SVD decomposition.

So, what about Word2Vec? How should we view the Word2Vec model? One scheme of Word2Vec, CBOW, sums the word vectors of several surrounding words, followed by an $N$-dimensional fully connected layer, and uses a softmax to predict the probability of the current word. As stated in the first half of this article, this summation of word vectors is equivalent to appending a fully connected layer (where the parameters are the word vector table) to the original Bag of Words model. Viewed this way, Word2Vec is also just a three-layer neural network with $N$-dimensional input, $n$ middle nodes, and $N$-dimensional output. From a network structure perspective, it is equivalent to an autoencoder, which in turn is equivalent to SVD decomposition.

From an implementation standpoint, the differences are also obvious:

1. This Word2Vec scheme can be seen as using surrounding words to predict the current word, whereas an autoencoder or SVD uses surrounding words to predict the surrounding words themselves;

2. Word2Vec finally connects to a softmax to predict probabilities, meaning it implements a non-linear transformation, while an autoencoder or SVD does not.

The extent to which these two differences affect the quality of the word vectors lacks rigorous mathematical proof here. However, based on actual testing, given the same corpus, the quality of Word2Vec's word vectors seems superior.

A Summary That Is Not a Summary

This article used a series of brainstorming sessions to think about the connections between traditional distributed representations and the Word2Vec model, ultimately arriving at these results. The primary goal of such thinking is to connect various aspects together to bring our understanding closer to the essence, which may provide important guidance for applying or even constructing models. Overall, this series of articles tells us that by writing many operations or models in matrix form, we can see the essence of many things. This helps guide our thinking on directions for improvement—for example, many models can actually be written as matrix multiplications, and matrix multiplication is equivalent to a layer in a neural network. Since it is a neural network, can we add more layers? Can we add activation functions? Can we use SVD to decompose it? Because according to the second article in this series, SVD has clear clustering significance. Such reflections can help us build or even generalize models.