What Exactly Are Word Vectors and Embedding?

By 苏剑林 | December 03, 2016

Word vectors, known in English as Word Embeddings, should literally be translated as "word embedding." When mentioning word vectors, many readers will immediately think of Google's Word2Vec—the brand effect is indeed powerful. Additionally, frameworks like Keras have an Embedding layer, which is also said to map word IDs to vectors. Due to preconceived notions, people tend to equate word vectors with Word2Vec, and conversely ask questions like "What kind of word vector is Embedding?" Especially for beginners, this can be very confusing. In fact, even for veterans, it is not necessarily easy to explain clearly. All of this must start with one-hot...

The Pot Calling the Kettle Black # one hot

One-hot is the most primitive way to represent characters or words. For simplicity, this article uses characters as an example; words are similar. Suppose there are six characters in the vocabulary: "科 (Sci), 学 (ence), 空 (S), 间 (pace), 不 (Not), 错 (Bad)". One-hot assigns a 0-1 encoding to each of these six characters:

$$\begin{array}{c|c}\hline\text{科} & [1, 0, 0, 0, 0, 0]\\ \text{学} & [0, 1, 0, 0, 0, 0]\\ \text{空} & [0, 0, 1, 0, 0, 0]\\ \text{间} & [0, 0, 0, 1, 0, 0]\\ \text{不} & [0, 0, 0, 0, 1, 0]\\ \text{错} & [0, 0, 0, 0, 0, 1]\\ \hline \end{array}$$

Now, if we want to represent the word "科学" (Science), we can use the matrix:

$$\begin{pmatrix}1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 \end{pmatrix}$$

You may sense the problem: however many characters there are, you must have that many dimensions in the vector. If there are 10,000 characters, each character vector is 10,000 dimensions (there might not be many commonly used characters, maybe a few thousand, but at the word level, there may be hundreds of thousands). To solve this, continuous vector representations emerged, such as using a 100-dimensional real-number vector to represent a character. This significantly reduces dimensionality and lowers the risk of overfitting, etc. Beginners say this, and many experts say it too.

However, the truth is: Total nonsense! Nonsense! Nonsense! Important things must be said three times.

Let me give you a problem to clarify things: if you are given two arbitrary 100th-order real-number matrices and asked to calculate their product, few people could do it manually. However, if you are given two 1000th-order matrices, but one is a one-hot matrix (where each row has only one element as 1 and all others are 0), and you are asked to multiply them, you can do it very quickly. If you don't believe it, try it.

Do you see the issue? A one-hot matrix is massive, but it is easy to calculate. Your "fancy" real-number matrix might be low-dimensional, but it is actually more troublesome to compute (though this computation is negligible for a computer)! Of course, the deeper reason lies below.

Paradoxically True # Let's actually calculate it once

$$\begin{pmatrix}1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 \end{pmatrix}\begin{pmatrix}w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\\ w_{31} & w_{32} & w_{33}\\ w_{41} & w_{42} & w_{43}\\ w_{51} & w_{52} & w_{53}\\ w_{61} & w_{62} & w_{63}\end{pmatrix}=\begin{pmatrix}w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\end{pmatrix}$$

The form on the left shows that this is a fully connected neural network layer with a $2 \times 6$ one-hot matrix as input and 3 hidden layer nodes. But look at the right: it is equivalent to simply selecting rows 1 and 2 from the $w_{ij}$ matrix. Isn't this exactly the same as the so-called "lookup" (finding the vector corresponding to a character in a table)? In fact, that is precisely what it is! This is the so-called Embedding layer: The Embedding layer is a fully connected layer with one-hot input and a number of hidden nodes equal to the word vector dimension! And the parameters of this fully connected layer are the "word vector table"!

From this perspective, character vectors haven't "done" anything! It is still one-hot. Stop mocking the problems of one-hot; word vectors are simply the parameters of the one-hot fully connected layer!

So, is there no innovation at all in character/word vectors? There is. From an operational standpoint, it was discovered through research that multiplying a one-hot matrix is equivalent to a table lookup. Therefore, we use lookup as the operation directly instead of writing it as a matrix for computation, which greatly reduces the computational load.

To emphasize again: the computational load was reduced not because of the emergence of word vectors, but because we simplified the one-hot matrix operation into a table lookup operation.

That is at the operational level. At the conceptual level, once the parameters of this fully connected layer are obtained, they are used directly as features—or rather, the parameters of this fully connected layer are used as the representation of the character/word, thus yielding character/word vectors. Finally, some interesting properties were discovered, such as the cosine similarity between vectors representing the similarity between words to some extent.

By the way, some criticize Word2Vec (CBOW) for being only a 3-layer model, not qualifying as "deep" learning. In fact, if you count the one-hot fully connected layer, it has 4 layers, making it effectively a small deep model.

Where Do They Come From? #

Wait, if you treat word vectors as parameters of a fully connected layer (dear reader, I must correct you: it's not "treating them as," they *are*), then you haven't told me how to get these parameters! The answer is: I don't know how they come about either. Don't neural network parameters depend on your task? You should ask yourself that; why ask me? You say Word2Vec is unsupervised? Let me clarify that again.

Strictly speaking, neural networks are all supervised. Models like Word2Vec are more accurately called "self-supervised." It actually trains a language model and obtains word vectors through the language model. A language model is simply a multi-class classifier that predicts the probability of the next character given the previous $n$ characters. We input one-hot, connect a fully connected layer, then several other layers, and finally a softmax classifier to get the language model. We then train it with massive amounts of text. Finally, the parameters obtained from the first fully connected layer become the character/word vector table. Of course, Word2Vec also made significant simplifications, but those simplifications were made to the language model itself; its first layer remains a fully connected layer, and those parameters are the word vector table.

Looking at it this way, the issue is quite simple. I don't necessarily have to use a language model to train vectors, do I? Correct! You can use other tasks, such as a supervised text sentiment classification task. As stated, it is just a fully connected layer; what you connect after it is entirely up to you.

Of course, since labeled data is generally scarce, this is prone to overfitting. Therefore, character/word vectors are usually pre-trained unsupervised with large-scale corpora to reduce the risk of overfitting. Note: The reason for reduced overfitting risk is the availability of unlabeled corpora for pre-training (unlabeled corpora can be massive, and a large enough corpus eliminates overfitting risk). It has nothing to do with word vectors themselves. A word vector is just a layer of parameters to be trained; what inherent power does it have to reduce overfitting?

Finally, why do these character/word vectors have certain properties, such as cosine similarity or Euclidean distance reflecting word similarity? This is because, when training unsupervised with language models, we use a window. By predicting the next character via the previous $n$ characters (where $n$ is the window size), words within the same window receive similar updates. These updates accumulate, and words with similar patterns will accumulate these similar updates to a significant degree. For example, the characters "忐" and "忑" (components of the word for "anxious") are almost always used together. When updating "忐," "忑" is almost always updated as well, so their updates are nearly identical. Thus, the vectors for "忐" and "忑" will inevitably be almost identical. "Similar patterns" means that in specific linguistic tasks, they are interchangeable. For instance, in a general corpus, "like" in "I like you" or in typical contexts remains a valid sentence if replaced by "hate." Therefore, "like" and "hate" will inevitably have similar word vectors. However, if the word vectors were trained specifically for a sentiment classification task, "like" and "hate" would have significantly different vectors.

To Be Continued (Not) #

I feel as though I haven't finished, but there doesn't seem to be much more to say. I hope this text helps everyone understand the concepts of character and word vectors and clarifies the essence of one-hot and Embedding. If you have questions or new insights, feel free to leave a comment.

Original article address: https://kexue.fm/archives/4122