[Incredible Word2Vec] 3. Extracting Keywords

By 苏剑林 | April 07, 2017

This article primarily provides a new definition for keywords and presents an implementation scheme based on Word2Vec. This definition of keywords is natural and reasonable; Word2Vec is merely a simplified implementation scheme. One could substitute it with other models based on the same definition.

When it comes to extracting keywords, one typically thinks of TF-IDF and TextRank. Have you ever considered that Word2Vec can also be used for keyword extraction? Moreover, extracting keywords with Word2Vec already inherently includes a degree of semantic understanding, rather than just simple statistics—and it is unsupervised!

What is a Keyword? #

Admittedly, TF-IDF and TextRank are two classic algorithms for keyword extraction, and they both possess a certain degree of rationality. However, the problem represents a leap of logic for readers who have never seen these two algorithms; it is unlikely they could construct them from scratch. That is to say, although these two algorithms appear simple, they are not easy to conceive intuitively. For instance, a student who has not studied information theory might find it difficult to understand why IDF takes a logarithm rather than some other function. How many readers would spontaneously think of using the PageRank logic to determine word importance?

Ultimately, the issue lies in this: extracting keywords and text summaries appear to be very natural tasks, but has anyone truly thought about what the definition of a keyword is? I am not asking for a dictionary definition, but for a mathematical definition. What should the mathematically sound definition of a keyword be? Or, what is our purpose in obtaining keywords?

Clearly, whether it is keywords or summaries, we hope to grasp the general meaning of an article as quickly as possible. If the keyword of an article is "Deep Learning," we know that the article is unlikely to be discussing "How the Steel Was Tempered" at length. In other words, we can guess the gist of the text from the keywords. Expressed mathematically, this is the conditional probability: $$p(s|w_i)$$ where $s$ represents a segment of text and $w_i$ is a word within that text. If $w_i$ is a keyword of the text, then it should maximize the above probability. That is to say, we only need to calculate this probability for all words in the sentence, sort them in descending order, and we can extract the keywords. Simply put, keywords are the words that best allow us to guess the original text. How do we estimate this probability? A simple way is to use the Naive Bayes assumption. If $s$ consists of $n$ words $w_1, w_2, \dots, w_n$, then: $$p(s|w_i)=p(w_1,w_2,\dots,w_n|w_i)=\prod_{k=1}^n p(w_k|w_i)$$ Thus, we only need to estimate the transition probability $p(w_k|w_i)$ between words to obtain the conditional probability $p(s|w_i)$, thereby completing the keyword extraction.

What is the connection to Word2Vec? To estimate $p(w_k|w_i)$, a large amount of text is needed for statistics. Fortunately, this process is unsupervised, and statistics is quite simple. However, we have a better tool. What tool is best at modeling $p(w_k|w_i)$? Readers might have already guessed: obviously, it is Word2Vec! Isn't Word2Vec's Skip-Gram model designed exactly to model this probability? Given Word2Vec's "fast, accurate, and powerful" characteristics, there's no reason not to use it! (Of course, as mentioned at the beginning, one does not necessarily have to use the Bayesian assumption, and certainly does not have to use Word2Vec to calculate it, but the definition of keywords itself should be reasonable.)

Word2Vec Calculating Probability #

At this point, readers should understand why I emphasized the Skip-Gram + Huffman Softmax combination so much in the previous two articles: because this combination is precisely for modeling $p(w_k|w_i)$. Of course, due to the nature of Huffman Softmax, calculating $p(w_k|w_i)$ requires a bit of effort. The reference code is as follows:

import numpy as np
import gensim
model = gensim.models.word2vec.Word2Vec.load('word2vec_wx')

def predict_proba(oword, iword):
    iword_vec = model[iword]
    oword = model.wv.vocab[oword]
    oword_l = model.syn1[oword.point].T
    dot = np.dot(iword_vec, oword_l)
    lprob = -sum(np.logaddexp(0, -dot) + oword.code*dot)
    return lprob

This is basically written by directly referencing the score_sg_pair function from gensim's Word2Vec. The simple flow is: take the Huffman encoding (path) of $w_k$, take the word vector of $w_i$, and then, according to the path, calculate the probability of each node on the path and multiply them together to get $p(w_k|w_i)$. Since we are calculating the log probability, multiplication becomes addition. How is the final probability calculated? In fact, according to the Word2Vec formula, the log probability of each node is:

\begin{align} &\log \left(\frac{1}{1+e^{-\boldsymbol{x}^{\top} \boldsymbol{\theta}}}\right)^{1-d}\left(1-\frac{1}{1+e^{-\boldsymbol{x}^{\top} \boldsymbol{\theta}}}\right)^{d} \nonumber \\ =&-(1-d)\log (1+e^{-\boldsymbol{x}^{\top} \boldsymbol{\theta}}) - d \log (1+e^{-\boldsymbol{x}^{\top} \boldsymbol{\theta}}) - d \boldsymbol{x}^{\top} \boldsymbol{\theta} \nonumber \\ =&-\log (1+e^{-\boldsymbol{x}^{\top} \boldsymbol{\theta}}) - d \boldsymbol{x}^{\top} \boldsymbol{\theta} \end{align}

where $\boldsymbol{\theta}$ is the node vector, $\boldsymbol{x}$ is the input word vector, and $d$ is the encoding of that node (either 0 or 1). However, the official score_sg_pair function is not written this way because:

\begin{align} &-\log (1+e^{-\boldsymbol{x}^{\top} \boldsymbol{\theta}}) - d \boldsymbol{x}^{\top} \boldsymbol{\theta} \nonumber \\ =&-\log \left[e^{d \boldsymbol{x}^{\top}\theta}(1+e^{-\boldsymbol{x}^{\top} \boldsymbol{\theta}})\right] \nonumber \\ =&-\log \left(e^{d \boldsymbol{x}^{\top}\theta}+e^{(d-1)\boldsymbol{x}^{\top} \boldsymbol{\theta}}\right) \nonumber \\ =&-\log \left(1+e^{-(-1)^d \boldsymbol{x}^{\top}\theta}\right) \end{align}

Practice Above All #

With the groundwork laid above, calculating keywords is now simple:

from collections import Counter
def keywords(s):
    s = [w for w in s if w in model]
    ws = {w:sum([predict_proba(u, w) for u in s]) for w in s}
    return Counter(ws).most_common()

import pandas as pd # introduced mainly for better display effects
import jieba
s = u'太阳是一颗恒星' # The sun is a star
pd.Series(keywords(jieba.cut(s)))

The output result is:

0 (恒星 [Star], -27.9013707845)
1 (太阳 [Sun], -28.1072913493)
2 (一颗 [A], -30.482187911)
3 (是 [is], -36.3372344659)

Other examples:

>>> s=u'昌平区政府网站显示，明十三陵是世界上保存完整、埋葬皇帝最多的墓葬群，1961年被国务院公布为第一批全国重点文物保护单位，并于2003年被列为世界遗产名录。' (The Changping District Government website shows that the Ming Tombs are the world's best-preserved imperial tomb cluster with the most buried emperors. In 1961, they were announced by the State Council as one of the first batch of national key cultural relics protection units, and in 2003, they were listed in the World Heritage List.)
>>> pd.Series(keywords(jieba.cut(s)))
0 (文物保护 [Cultural relics protection], -261.691625676)
1 (名录 [List], -272.297758506)
2 (世界遗产 [World Heritage], -273.943120665)
3 (第一批 [First batch], -280.781786703)
4 (列为 [Listed as], -281.663865896)
5 (明十三陵 [Ming Tombs], -286.298893108)
6 (墓葬群 [Tomb cluster], -287.463013816)
...

>>> s=u'雄安新区横空出世，吸引了众多外地炒房客前去购房。然而，在当地政府重拳遏制非法炒房、楼市冻结的背景下，那些怀揣买房钱却在雄安新区无处下手的投资需求，被挤出到周边地区。' (The Xiongan New Area emerged suddenly, attracting many out-of-town real estate speculators to buy houses. However, against the backdrop of the local government's heavy crackdown on illegal speculation and the freezing of the real estate market, investment demand from those clutching money but unable to buy in Xiongan New Area has been squeezed into surrounding areas.)
>>> pd.Series(keywords(jieba.cut(s)))
0 (炒房客 [Speculators], -326.997266407)
1 (楼市 [Real estate market], -336.176584187)
2 (炒房 [House speculation], -337.190896137)
3 (买房 [Buying houses], -344.613473556)
4 (购房 [Purchasing houses], -346.396359454)
5 (重拳 [Heavy fist/crackdown], -350.207272082)
6 (外地 [Out-of-town], -355.860419218)

>>> s=u'如果给一部古装电影设计服装，必须要考虑故事发生在哪个朝代，汉朝为宽袍大袖，清朝则是马褂旗袍。可在京剧舞台上，几乎任何一个历史人物，根据他的性别年龄、身份地位、基本性格等等，都可以在现有的服饰里找到合适的行头。 ' (If designing costumes for a period film, one must consider in which dynasty the story takes place: the Han Dynasty wore wide-sleeved robes, whereas the Qing Dynasty wore magua and cheongsam. But on the Peking Opera stage, nearly any historical figure—regardless of gender, age, status, or personality—can find suitable attire within the existing wardrobe.)
>>> pd.Series(keywords(jieba.cut(s)))
0 (朝代 [Dynasty], -485.150966757)
1 (人物 [Character/Figure], -493.759615898)
2 (古装 [Period costume], -495.478962392)
3 (汉朝 [Han Dynasty], -503.409908377)
4 (清朝 [Qing Dynasty], -503.45656029)
5 (旗袍 [Cheongsam], -504.76313228)
6 (身份 [Status/Identity], -507.624260109)

You can try it yourselves. If you want to try it on your own corpus, just train a Word2Vec (Skip-Gram + Huffman Softmax) model on the corpus and then call the code above.

There Should Be Some Doubts #

According to our original idea, $p(w_k|w_i)$ should be statistically calculated across the entire sentence, but Word2Vec only uses a window for calculation. Is this reasonable? In fact, although Word2Vec only uses a window, it has successfully established connections between similar words. That is to say, using Word2Vec for the above process essentially aggregates "similar words" for evaluation. In contrast, the TF-IDF method only aggregates "identical words" for evaluation. Therefore, we say that extracting keywords with Word2Vec can initially use semantics for judgment. Furthermore, by considering $p(w_k|w_i)$, Word2Vec accounts for internal associations within the article, which smacks of TextRank—it is a bigram-like model, whereas TF-IDF only considers the information content of the word itself, making it a unigram model.

Moreover, Word2Vec is trained based on neural networks and comes with inherent smoothing capabilities; even if two words have never co-occurred in the text, a relatively reasonable probability can still be obtained.

The cost of doing this, of course, is efficiency: the efficiency of the TF-IDF algorithm is $\mathcal{O}(N)$, while extracting with Word2Vec is clearly $\mathcal{O}(N^2)$, where $N$ is the number of words in the sentence.