[Incredible Word2Vec] 1. Mathematical Principles

By 苏剑林 | April 02, 2017

For readers familiar with Deep Learning and Natural Language Processing (NLP), Word2Vec is a household name. Although not everyone has used it personally, most people have heard of it—Google's highly efficient tool for obtaining word vectors.

Is Word2Vec Incredible?

Most people treat Word2Vec as a synonymous term for word vectors. In other words, they use it purely as a tool for obtaining word representations, and few readers concern themselves with the model itself. This might be because the model is so simplified that people assume such a simple model must be inaccurate and therefore unusable for modeling language, even if its byproduct—the word vectors—is quite good. Indeed, if viewed as a language model, Word2Vec is far too crude.

However, why should we view it only as a language model? Setting aside the constraints of language modeling and looking at the model itself, we find that the two Word2Vec models—CBOW and Skip-Gram—are actually extremely useful. They describe the relationship between surrounding words and the current word from different perspectives. Many basic NLP tasks, such as keyword extraction and logical reasoning, are built upon this relationship. This series of articles hopes to provide some inspiration by introducing the Word2Vec model itself and several "incredible" use cases, offering new ways to research such problems.

Speaking of Word2Vec being "incredible," when it was first released, perhaps the most surprising feature was its Word Analogy property—linear characteristics such as king - man ≈ queen - woman. The author, Mikolov, believed this property implied that the word vectors generated by Word2Vec possessed semantic reasoning capabilities. It was this feature, combined with the Google pedigree, that made Word2Vec rapidly popular. Unfortunately, when training word vectors ourselves, it is often difficult to reproduce this exact result, and there isn't even a solid theoretical basis suggesting that a good set of word vectors must satisfy this Word Analogy property. In contrast, the various uses I will introduce here are highly reproducible; readers can even train a Word2Vec model on a small corpus and obtain similar results.

Mathematical Principles: Online Resources

Readers interested in this series should understand the mathematical principles of Word2Vec. Since Word2Vec has been out for several years, there are countless articles introducing it. Personally, I recommend the series of blog posts by the expert peghoty: http://blog.csdn.net/itplus/article/details/37969519

Additionally, the post "What's Really Going On with Word Vectors and Embeddings?" on this blog is also helpful for understanding the principles of Word2Vec.

For the convenience of readers, I have collected two corresponding PDF files:

Mathematical Principles of word2vector in Detail.pdf

Deep Learning in Action: word2vec.pdf

The first one is the PDF version of the recommended blog series by peghoty. Of course, if your English is good, you can read the original Word2Vec papers directly:

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

However, I personally feel that the original papers are not as clearly explained as the Chinese commentaries.

Mathematical Principles: A Simple Explanation

Simply put, Word2Vec consists of "two training schemes + two acceleration methods," so strictly speaking, it offers four candidate models.

The two training schemes are CBOW and Skip-Gram, as shown in the figure below:

(Two Models of Word2Vec)

In colloquial terms, they are "summing up surrounding words to predict the current word" ($P(w_t|Context)$) and "using the current word to predict surrounding words individually" ($P(w_{others}|w_t)$). These are essentially conditional probability modeling problems. The two acceleration methods are Hierarchical Softmax and Negative Sampling. Hierarchical Softmax is a simplification of the Softmax function, directly reducing the complexity of predicting probabilities from $\mathcal{O}(|V|)$ to $\mathcal{O}(\log_2 |V|)$, though its precision is slightly lower than the original Softmax. Negative Sampling takes the opposite approach; it combines the original input and output as a single input and performs a binary classification to assign a score. This can be seen as modeling the joint probabilities $P(w_t, Context)$ and $P(w_t, w_{others})$. Positive samples are those that appear in the corpus, while negative samples are randomly drawn. For more details, it is best to study peghoty's blog series closely; that is also where I learned the implementation details of Word2Vec.

Finally, it should be noted that the model used in this series is the combination of "Skip-Gram + Hierarchical Softmax." That is, we will be using the $P(w_{others}|w_t)$ model itself, not just the word vectors. Therefore, readers who wish to follow this series will need some understanding of the Skip-Gram model and some familiarity with the construction and implementation of Hierarchical Softmax.

Please stay tuned~