【Unbelievable Word2Vec】 4. A Different Kind of "Similarity"

By 苏剑林 | May 1, 2017

Definition of Similarity

When we obtain word vectors using Word2Vec, we generally use cosine similarity to compare the degree of similarity between two words, defined as: $$\cos (\boldsymbol{x}, \boldsymbol{y}) = \frac{\boldsymbol{x}\cdot\boldsymbol{y}}{\|\boldsymbol{x}\|\times\|\boldsymbol{y}\|}$$ With this concept of similarity, we can compare the similarity between any two words or find the words most similar to a given word. In gensim's Word2Vec, this is implemented by the most_similar function.

Wait a minute! We quickly provided the mathematical formula for calculating similarity, but we haven't even "defined" what similarity is! How can we arrive at a formula to evaluate similarity without even defining it?

It is important to note that this is not an issue that can be casually ignored. Often, we don't know exactly what we are doing before we start doing it. As mentioned in the previous article regarding keyword extraction, I believe many people have never considered what a keyword actually is—is it just a word that is "key"? If we think of keywords as words used to estimate what an article is roughly about, we get a natural definition for keywords: $$keywords = \mathop{\text{argmax}}_{w\in s}p(s|w)$$ From there, we can model it using various methods.

Returning to the theme of this article, how do we define similarity? The answer is: it depends on the similarity required by the specific scenario.

So, what kind of similarity does cosine similarity provide? In fact, Word2Vec essentially uses the average distribution of context to describe the current word (since Word2Vec does not consider word order). Since the cosine value has nothing to do with the length (norm) of the vector, it describes "relative consistency." Thus, high cosine similarity essentially means that these two words are often paired with the same set of words, or more crudely, that the two words are interchangeable within the same sentence. For example, the words most similar to "Guangzhou" are "Dongguan" and "Shenzhen." This is because, in many scenarios, directly replacing "Guangzhou" in a sentence with "Dongguan" or "Shenzhen" still results in a reasonable sentence (reasonable in terms of sentence structure, though not necessarily factual; for instance, "Guangzhou is the capital of Guangdong" becomes "Dongguan is the capital of Guangdong"—the sentence is structurally sound, but the fact is incorrect).

>>> s = u'广州' (Guangzhou)
>>> pd.Series(model.most_similar(s))
0 (东莞 [Dongguan], 0.840889930725)
1 (深圳 [Shenzhen], 0.799216389656)
2 (佛山 [Foshan], 0.786817014217)
3 (惠州 [Huizhou], 0.779960155487)
4 (珠海 [Zhuhai], 0.735232532024)
5 (厦门 [Xiamen], 0.725090026855)
6 (武汉 [Wuhan], 0.724122405052)
7 (汕头 [Shantou], 0.719602525234)
8 (增城 [Zengcheng], 0.713532149792)
9 (上海 [Shanghai], 0.710560560226)

Relevance: Another Kind of Similarity

As previously mentioned, the definition of similarity depends on the scenario, and cosine similarity is just one of them. Sometimes we might feel that "Dongguan" and "Guangzhou" have no connection at all. For an "old resident of Guangzhou," words like "Baiyun Mountain," "Baiyun Airport," and "Canton Tower" are the ones most similar to "Guangzhou." This scenario is also very common, such as in tourism recommendations. When someone comes to Guangzhou for travel, it is natural to hope that after inputting "Guangzhou," words related to Guangzhou like "Baiyun Mountain," "Baiyun Airport," and "Canton Tower" are automatically output, rather than "Dongguan" or "Shenzhen."

This "similarity" is more accurately described as "relevance." How should we describe it? The answer is Mutual Information, defined as: $$\log \frac{p(x,y)}{p(x)p(y)}=\log p(y|x) - \log p(y)$$ The larger the mutual information, the more frequently the words $x$ and $y$ appear together.

In this way, given word $x$, we can find words that frequently appear alongside word $x$. This can be accomplished entirely by the Skip-Gram + Huffman Softmax model in Word2Vec. The code is as follows:

import numpy as np
import gensim
model = gensim.models.word2vec.Word2Vec.load('word2vec_wx')

def predict_proba(oword, iword):
    iword_vec = model[iword]
    oword = model.wv.vocab[oword]
    oword_l = model.syn1[oword.point].T
    dot = np.dot(iword_vec, oword_l)
    lprob = -sum(np.logaddexp(0, -dot) + oword.code*dot)
    return lprob

from collections import Counter
def relative_words(word):
    r = {i:predict_proba(i, word)-np.log(j.count) for i,j in model.wv.vocab.iteritems()}
    return Counter(r).most_common()

At this point, the relevant words for "Guangzhou" are:

>>> s = u'广州' (Guangzhou)
>>> w = relative_words(s)
>>> pd.Series(w)
0 (福中路 [Fuzhong Road], -17.390365773)
1 (OHG, -17.4582544641)
2 (林寨镇 [Linzhai Town], -17.6119545612)
3 (坪山街道 [Pingshan Subdistrict], -17.6462214199)
4 (东圃镇 [Dongpu Town], -17.6648893759)
5 (西翼 [West Wing], -17.6796614955)
6 (北京西 [Beijing West], -17.6898282385)
7 (⇋, -17.6950761384)
8 (K1019, -17.7259853233)
9 (景泰街道 [Jingtai Subdistrict], -17.7292421556)
10 (PSW3, -17.7296432222)
11 (广州铁路职业技术学院 [Guangzhou Railway Polytechnic], -17.732288911)
12 (13A06, -17.7382891287)
13 (5872, -17.7404719442)
14 (13816217517, -17.7650583156)
15 (未遂案 [Attempted Case], -17.7713452536)
16 (增城市 [Zengcheng City], -17.7713832873)
17 (第十甫路 [Dixiafu Road], -17.7727940473)
18 (广州白云机场 [Guangzhou Baiyun Airport], -17.7897457043)
19 (Faust, -17.7956389314)
20 (国家档案馆 [National Archives], -17.7971039916)
21 (w0766fc, -17.8051687721)
22 (K1020, -17.8106548248)
23 (陈宝琛 [Chen Baochen], -17.8427718407)
24 (jinriGD, -17.8647825023)
25 (3602114109100031646, -17.8729896156)

It can be seen that the results obtained are basically all closely related to Guangzhou. Of course, sometimes we like to emphasize high-frequency words slightly; therefore, we can consider modifying the mutual information formula to: $$\log \frac{p(x,y)}{p(x)p^{\alpha}(y)}=\log p(y|x) - \alpha\log p(y)$$ Here, $\alpha$ is a constant slightly less than 1. If we take $\alpha=0.9$, then:

from collections import Counter
def relative_words(word):
    r = {i:predict_proba(i, word)-0.9*np.log(j.count) for i,j in model.wv.vocab.iteritems()}
    return Counter(r).most_common()

The results rearranged are as follows:

>>> s = u'广州' (Guangzhou)
>>> w = relative_words(s)
>>> pd.Series(w)
0 (福中路 [Fuzhong Road], -16.8342976099)
1 (北京西 [Beijing West], -16.9316053191)
2 (OHG, -16.9532688634)
3 (西翼 [West Wing], -17.0521852934)
4 (增城市 [Zengcheng City], -17.0523156839)
5 (广州白云机场 [Guangzhou Baiyun Airport], -17.0557270208)
6 (林寨镇 [Linzhai Town], -17.0867272184)
7 (⇋, -17.1061883426)
8 (坪山街道 [Pingshan Subdistrict], -17.1485480457)
9 (5872, -17.1627067119)
10 (东圃镇 [Dongpu Town], -17.192150594)
11 (PSW3, -17.2013228493)
12 (Faust, -17.2178736991)
13 (红粉 [Hongfen], -17.2191157626)
14 (国家档案馆 [National Archives], -17.2218467278)
15 (未遂案 [Attempted Case], -17.2220391092)
16 (景泰街道 [Jingtai Subdistrict], -17.2336594498)
17 (光孝寺 [Guangxiao Temple], -17.2781121397)
18 (国际货运代理 [International Freight Forwarding], -17.2810157155)
19 (第十甫路 [Dixiafu Road], -17.2837591345)
20 (广州铁路职业技术学院 [Guangzhou Railway Polytechnic], -17.2953441257)
21 (芳村 [Fangcun], -17.301106775)
22 (检测院 [Testing Institute], -17.3041253252)
23 (K1019, -17.3085465963)
24 (陈宝琛 [Chen Baochen], -17.3134413583)
25 (林和西 [Linhexi], -17.3150577006)

Relatively speaking, the latter result is more readable. Other results are shown below:

>>> s = u'飞机' (Airplane)
>>> w = relative_words(s)
>>> pd.Series(w)
0 (澳门国际机场 [Macau International Airport], -16.5502216186)
1 (HawkT1, -16.6055740672)
2 (架飞机 [Numerical for airplanes], -16.6105400944)
3 (地勤人员 [Ground crew], -16.6764712234)
4 (美陆军 [US Army], -16.6781627384)
5 (SU200, -16.6842796275)
6 (起降 [Takeoff and landing], -16.6910345896)
7 (上海浦东国际机场 [Shanghai Pudong International Airport], -16.7040362134)
8 (备降 [Diverting/Alternate landing], -16.7232609719)
9 (第一架 [The first plane], -16.7304077856)

>>> pd.Series(model.most_similar(s))
0 (起飞 [Take off], 0.771412968636)
1 (客机 [Passenger plane], 0.758365988731)
2 (直升机 [Helicopter], 0.755871891975)
3 (一架 [One plane], 0.749522089958)
4 (起降 [Takeoff and landing], 0.726713418961)
5 (降落 [Landing], 0.723304390907)
6 (架飞机 [Airplanes], 0.722024559975)
7 (飞行 [Flight/Flying], 0.700125515461)
8 (波音 [Boeing], 0.697083711624)
9 (喷气式飞机 [Jet plane], 0.696866035461)

>>> s = u'自行车' (Bicycle)
>>> w = relative_words(s)
>>> pd.Series(w)
0 (骑 [Ride], -16.4410312554)
1 (放风筝 [Fly a kite], -16.6607225423)
2 (助力车 [Moped], -16.8390451582)
3 (自行车 [Bicycle], -16.900188791)
4 (三轮车 [Tricycle], -17.1053629907)
5 (租赁点 [Rental point], -17.1599389605)
6 (电动车 [Electric vehicle], -17.2038996636)
7 (助动车 [Power-assisted vehicle], -17.2523149342)
8 (多辆 [Multiple vehicles], -17.2629832083)
9 (CRV, -17.2856425014)

>>> pd.Series(model.most_similar(s))
0 (摩托车 [Motorcycle], 0.737690329552)
1 (骑 [Ride], 0.721182465553)
2 (滑板车 [Scooter], 0.7102201581)
3 (电动车 [Electric bike], 0.700758457184)
4 (山地车 [Mountain bike], 0.687280654907)
5 (骑行 [Cycling], 0.666575074196)
6 (单车 [Bike], 0.651858925819)
7 (骑单车 [Cycling], 0.650207400322)
8 (助力车 [Moped], 0.635745406151)
9 (三轮车 [Tricycle], 0.630989730358)

You can try it yourself. It should be noted: unfortunately, while Huffman Softmax accelerates computation during the training phase, during the prediction phase when it is necessary to traverse the entire dictionary, it is actually slower than the original Softmax. Therefore, this is not a high-efficiency solution.

What Have We Actually Done?

According to the previous two parts, we can see that there are generally two scenarios for "similarity": 1. Often paired with the same set of words; 2. Often appearing together. Both scenarios can be considered as similarity between words, applicable to different needs.

For example, when performing word sense disambiguation for polysemous words—such as whether "star" refers to a "celestial body" or a "celebrity"—we can utilize mutual information. We can find a corpus where "star" means "celestial body" beforehand and identify words with high mutual information with "star," such as "sun," "planet," or "earth." Similarly, we can find a corpus where "star" means "celebrity" and find corresponding words like "entertainment" or "movie." In a new context, we can then infer which meaning it is based on the surrounding context.

In short, you must define your needs clearly and then consider the corresponding method.