By 苏剑林 | April 3, 2017
Since the following posts will explain how to use Word2Vec, I have pre-trained a Word2Vec model. To save readers' time and ensure that everyone can reproduce the results in later sections, I have decided to share this pre-trained model, which was trained using Gensim. While simple word vectors aren't very large, as mentioned in the first post, we will need the complete Word2Vec model. Therefore, I am sharing the full model, which consists of four files, making the total size somewhat large.
A reminder to readers: if you want to obtain a complete Word2Vec model without modifying the source code, Python's Gensim library is essentially your only choice. As far as I know, other versions of Word2Vec only provide the final word vectors and do not include the complete model parameters.
For the purpose of knowledge mining, Word2Vec trained on knowledge base corpora (such as Encyclopedia/Wikipedia data) usually yields better results. I am still in the process of crawling encyclopedia data; once finished, I will train another model and share it then.
Model Overview
The general profile of this model is as follows:
\[
\begin{array}{c|c}
\hline
\text{Training Corpus} & \text{WeChat Official Account articles, multi-domain, balanced Chinese corpus} \\
\hline
\text{Corpus Volume} & \text{8 million articles, total word count reaching 65 billion} \\
\hline
\text{Vocabulary Size} & \text{352,196 words total, mostly Chinese, includes common English words} \\
\hline
\text{Model Architecture} & \text{Skip-Gram + Huffman Softmax} \\
\hline
\text{Vector Dimensions} & \text{256 dimensions} \\
\hline
\text{Tokenization Tool} & \text{Jieba, with a 500,000-entry dictionary, new word discovery disabled} \\
\hline
\text{Training Tool} & \text{Gensim Word2Vec, trained for 7 days on a server} \\
\hline
\text{Other Details} & \text{Window size of 10, minimum word count of 64, 10 iterations} \\
\hline
\end{array}
\]
It is important to note: WeChat articles are relatively "modern" and reflect recent internet hot spots with wide coverage, making the content quite representative. For tokenization, I used Jieba and disabled its "new word discovery" feature—I'd rather have fewer words segmented accurately than many segmented poorly. Of course, the default dictionary is insufficient; I compiled a 500,000-entry dictionary from two sources: 1. Merged dictionaries collected from the web; 2. New word discovery performed on WeChat articles, followed by manual filtering. Consequently, the segmentation results are quite reliable and include many popular slang terms, making it highly usable.
Training Code
You can refer to this for your own modifications. Note that `hashlib.md5` is used here for deduplication (originally 10 million articles were reduced to 8 million after deduplication); this step is not strictly necessary.
# (Original code block provided in the blog for reference)
Download Links
Link: https://pan.baidu.com/s/1htC495U Password: 4ff8
Included files: word2vec_wx, word2vec_wx.syn1neg.npy, word2vec_wx.syn1.npy, word2vec_wx.wv.syn0.npy. All four files are necessary for Gensim to load the model. The specific meaning of each file is not entirely clear, but word2vec_wx is likely the model descriptor, word2vec_wx.wv.syn0.npy should be the word vector table, word2vec_wx.syn1.npy contains the parameters from the hidden layer to the output layer (Huffman tree parameters), and word2vec_wx.syn1neg.npy is less certain.
If you only care about the word vectors, you can also download the C-version format (compatible with the original C version of Word2Vec, containing only vectors):
Link: https://pan.baidu.com/s/1nv3ANLB Password: dgfw
Some Demonstrations
Here are some random demonstrations of the model's results for finding similar words. Suggestions for improvement are welcome.
>>> import gensim
>>> model = gensim.models.Word2Vec.load('word2vec_wx')
>>> pd.Series(model.most_similar(u'微信')) # WeChat
0 (QQ, 0.752506196499)
1 (订阅号, 0.714340209961) # Subscription Account
2 (QQ号, 0.695577561855) # QQ Number
3 (扫一扫, 0.695488214493) # Scan QR Code
4 (微信公众号, 0.694692015648) # WeChat Official Account
5 (私聊, 0.681655049324) # Private Chat
6 (微信公众平台, 0.674170553684) # WeChat Public Platform
7 (私信, 0.65382117033) # Direct Message
8 (微信平台, 0.65175652504) # WeChat Platform
9 (官方, 0.643620729446) # Official
>>> pd.Series(model.most_similar(u'公众号')) # Official Account
0 (订阅号, 0.782696723938)
1 (微信公众号, 0.760639667511)
2 (微信公众账号, 0.73489522934)
3 (公众平台, 0.716173946857)
4 (扫一扫, 0.697836577892)
5 (微信公众平台, 0.696847081184)
6 (置顶, 0.666775584221) # Pin to top
7 (公共账号, 0.665741920471)
8 (微信平台, 0.661035299301)
9 (菜单栏, 0.65234708786) # Menu Bar
>>> pd.Series(model.most_similar(u'牛逼')) # Awesome (slang)
0 (牛掰, 0.701575636864)
1 (厉害, 0.619165301323)
2 (靠谱, 0.588266670704)
3 (苦逼, 0.586573541164)
4 (吹牛逼, 0.569260418415)
5 (了不起, 0.565731525421)
6 (牛叉, 0.563843131065)
7 (绝逼, 0.549570798874)
8 (说真的, 0.549259066582)
9 (两把刷子, 0.545115828514)
>>> pd.Series(model.most_similar(u'广州')) # Guangzhou
0 (东莞, 0.840889930725)
1 (深圳, 0.799216389656)
2 (佛山, 0.786817133427)
3 (惠州, 0.779960036278)
4 (珠海, 0.73523247242)
5 (厦门, 0.72509008646)
6 (武汉, 0.724122405052)
7 (汕头, 0.719602584839)
8 (增城, 0.713532209396)
9 (上海, 0.710560560226)
>>> pd.Series(model.most_similar(u'朱元璋')) # Zhu Yuanzhang (Ming Founder)
0 (朱棣, 0.857951819897)
1 (燕王, 0.853199958801)
2 (朝廷, 0.847517609596)
3 (明太祖朱元璋, 0.837111353874)
4 (赵匡胤, 0.835654854774)
5 (称帝, 0.835589051247)
6 (起兵, 0.833530187607)
7 (明太祖, 0.829249799252)
8 (太祖, 0.826784193516)
9 (丞相, 0.826457977295)
>>> pd.Series(model.most_similar(u'微积分')) # Calculus
0 (线性代数, 0.808522999287) # Linear Algebra
1 (数学分析, 0.791161835194) # Mathematical Analysis
2 (高等数学, 0.786414265633) # Higher Mathematics
3 (数学, 0.758676528931) # Mathematics
4 (概率论, 0.747221827507) # Probability Theory
5 (高等代数, 0.737897276878) # Advanced Algebra
6 (解析几何, 0.730488717556) # Analytic Geometry
7 (复变函数, 0.715447306633) # Functions of Complex Variables
8 (微分方程, 0.71503329277) # Differential Equations
9 (微积分学, 0.704192101955) # Calculus (discipline)
>>> pd.Series(model.most_similar(u'apple'))
0 (banana, 0.79927945137)
1 (pineapple, 0.789698243141)
2 (pen, 0.779583632946)
3 (orange, 0.769554674625)
4 (sweet, 0.721074819565)
5 (fruit, 0.71402490139)
6 (pie, 0.711439430714)
7 (watermelon, 0.700904607773)
8 (apples, 0.697601020336)
9 (juice, 0.694036960602)
>>> pd.Series(model.most_similar(u'企鹅')) # Penguin
0 (海豹, 0.665253281593) # Seal
1 (帝企鹅, 0.645192623138) # Emperor Penguin
2 (北极熊, 0.619929730892) # Polar Bear
3 (大象, 0.618502140045) # Elephant
4 (鲸鱼, 0.606555819511) # Whale
5 (猫, 0.591019570827) # Cat
6 (蜥蜴, 0.584576964378) # Lizard
7 (蓝鲸, 0.572826981544) # Blue Whale
8 (海豚, 0.566122889519) # Dolphin
9 (猩猩, 0.563284397125) # Gorilla
>>> pd.Series(model.most_similar(u'足球')) # Football/Soccer
0 (篮球, 0.842746257782) # Basketball
1 (足球运动, 0.819511592388) # Football (the sport)
2 (青训, 0.793446540833) # Youth Training
3 (排球, 0.774085760117) # Volleyball
4 (乒乓球, 0.760577201843) # Table Tennis
5 (足球赛事, 0.758624792099) # Football Matches
6 (棒垒球, 0.750351667404) # Baseball/Softball
7 (篮球运动, 0.746055066586)
8 (足球队, 0.74296438694) # Football Team
9 (网球, 0.742858171463) # Tennis
>>> pd.Series(model.most_similar(u'爸爸')) # Dad
0 (妈妈, 0.779690504074) # Mom
1 (儿子, 0.752222895622) # Son
2 (奶奶, 0.70418381691) # Grandma
3 (妈, 0.693783283234) # Mother
4 (爷爷, 0.683066487312) # Grandpa
5 (父亲, 0.673043072224) # Father
6 (女儿, 0.670304119587) # Daughter
7 (爸妈, 0.669358253479) # Mom and Dad
8 (爸, 0.663688421249) # Pop
9 (外婆, 0.652905225754) # Maternal Grandmother
>>> pd.Series(model.most_similar(u'淘宝')) # Taobao
0 (淘, 0.770935535431)
1 (店铺, 0.739198565483) # Store
2 (手机端, 0.728774428368) # Mobile End
3 (天猫店, 0.725838780403) # Tmall Store
4 (口令, 0.721312999725) # Password/Token
5 (登录淘宝, 0.717839717865) # Login to Taobao
6 (淘宝店, 0.71473968029) # Taobao Store
7 (淘宝搜, 0.697688698769) # Search on Taobao
8 (天猫, 0.690212249756) # Tmall
9 (网店, 0.6820114851) # Online Store