A More Unique Word Vector Model (Part 5): Interesting Results

By 苏剑林 | November 19, 2017

Finally, let's take a look at what good properties the word vector model $(15)$ possesses, or rather, what rewards we get for going through such pains to construct a new word vector model?

Meaning of the Norm

It seems that in almost all word vector models, very little attention is paid to the norm (length) of the word vectors. Interestingly, in the word vectors obtained from our aforementioned model, the norm can represent the importance of a word to a certain extent. We can understand this fact from two perspectives.

In a context within a single window, the probability of the center word reappearing is actually not large; it is a relatively random event. Therefore, we can roughly assume: \[P(w,w) \sim P(w)\tag{24}\]

We can also understand it from another angle by decomposing each vector into its norm and direction: \[\boldsymbol{v}=\Vert\boldsymbol{v}\Vert\cdot\frac{\boldsymbol{v}}{\Vert\boldsymbol{v}\Vert}\tag{27}\] Where the norm $\|\boldsymbol{v}\|$ is an independent parameter, and the direction vector $\boldsymbol{v}/\Vert\boldsymbol{v}\Vert$ consists of $n-1$ independent parameters, where $n$ is the dimension of the word vector. Since the number of parameters differs significantly, when solving for word vectors, if a goal can be achieved by adjusting the norm, the model will naturally choose to adjust the norm rather than struggling to adjust the direction. According to $(15)$, we have: \[\log\frac{P(w_i,w_j)}{P(w_i)P(w_j)}=\langle \boldsymbol{v}_i, \boldsymbol{v}_j\rangle=\Vert\boldsymbol{v}_i\Vert\cdot \Vert\boldsymbol{v}_j\Vert\cdot \cos\theta_{ij}\tag{28}\] For words like "的" (of), "了" (aspect marker) that have almost no meaning, in which direction will their word vectors develop? As mentioned earlier, they appear very frequently but have almost no fixed collocations with anyone; they basically just wander around their surroundings. So we can assume that for any word $w_i$: \[\log\frac{P(w_i,\text{的})}{P(w_i)P(\text{的})}\approx 0\tag{29}\] To achieve this goal, the most convenient method is naturally setting $\Vert\boldsymbol{v}_{\text{的}}\Vert\approx 0$. Adjusting one parameter can achieve this, and the model is certainly willing to do so. That is to say, for words with high frequency but overall low mutual information (these words usually have no special meaning), the norm will automatically approach 0. Therefore, we say the norm of the word vector can represent the importance of the word to some extent.

In a set of word vectors trained using the model in this article and the Baidu Baike corpus, without truncating weights, we sort the word vectors by norm in ascending order. The first 50 results are:

\[\begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline \text{。} & \text{，} & \text{的} & \text{和} & \text{同样} & \text{也} & \text{1} & \text{3} & \text{并且} & \text{另外} \\ \hline \text{同时} & \text{是} & \text{2} & \text{6} & \text{总之} & \text{在} & \text{以及} & \text{5} & \text{因此} & \text{4} \\ \hline \text{7} & \text{8} & \text{等等} & \text{又} & \text{并} & \text{；} & \text{与此同时} & \text{然而} & \text{当中} & \text{事实上}\\ \hline \text{显然} & \text{这样} & \text{所以} & \text{例如} & \text{还} & \text{当然} & \text{就是} & \text{这些} & \text{而} & \text{因而} \\ \hline \text{此外} & \text{）} & \text{便是} & \text{即使} & \text{比如} & \text{因为} & \text{由此可见} & \text{一} & \text{有} & \text{即} \\ \hline \end{array}\]

It is clear that these words are indeed what we call "stop words" or "function words," verifying that the norm represents the importance of the word itself. This result is somewhat related to whether the weight is truncated, because when truncating weights, the resulting order is:

\[\begin{array}{|c|c|c|c|c|c|c|c|c|c|} \hline \text{。} & \text{，} & \text{总之} & \text{同样} & \text{与此同时} & \text{除此之外} & \text{当中} & \text{便是} & \text{显然} & \text{无论是} \\ \hline \text{另外} & \text{不但} & \text{事实上} & \text{由此可见} & \text{即便} & \text{原本} & \text{先是} & \text{其次} & \text{后者} & \text{本来} \\ \hline \text{原先} & \text{起初} & \text{为此} & \text{另一个} & \text{其二} & \text{值得一提} & \text{看出} & \text{最初} & \text{或是} & \text{基本上} \\ \hline \text{另} & \text{从前} & \text{做为} & \text{自从} & \text{称之为} & \text{诸如} & \text{现今} & \text{那时} & \text{却是} & \text{如果说} \\ \hline \text{由此} & \text{的确} & \text{另一方面} & \text{其后} & \text{之外} & \text{在内} & \text{当然} & \text{前者} & \text{之所以} & \text{此外} \\ \hline \end{array}\]

The obvious difference between the two tables is that in the second table, although they are still mostly stop words, some more obvious stop words like "的" (of) and "是" (is) are not at the front. This is because their word frequencies are quite high, so the impact of truncation is greater, leading to the possibility of underfitting (simply put, more attention is paid to low-frequency words, while high-frequency words only need to be "reasonable"). Why do the period and comma still rank highly? Because the probability of appearing twice in a window for a period "。" is much smaller than for "的". Therefore, the usage of the period "。" more closely matches the hypothesis of our derivation above, whereas "的" may appear multiple times in a window, hence the mutual information between "的" and itself should be larger, and its norm will be larger accordingly.

Word Analogy Experiments

Since we claim that the word analogy property is the definition of this model, does the model actually perform well in word analogies? Let's look at some examples.

\[\begin{array}{c|c} \hline A + B - C & D \\ \hline \text{Airport + Train - Airplane} & \text{Railway Station, Direct, East Station, High-speed Rail Station, South Station, Passenger Station} \\ \hline \text{King + Woman - Man} & \text{II, I, Queen, Kingdom, III, IV} \\ \hline \text{Beijing + UK - China} & \text{London, Paris, Residence, Moved to, Edinburgh, Brussels} \\ \hline \text{London + USA - UK} & \text{New York, Los Angeles, London, Chicago, San Francisco, Atlanta} \\ \hline \text{Guangzhou + Zhejiang - Guangdong} & \text{Hangzhou, Ningbo, Jiaxing, Jinhua, Huzhou, Shanghai} \\ \hline \text{Guangzhou + Jiangsu - Guangdong} & \text{Changzhou, Wuxi, Suzhou, Nanjing, Zhenjiang, Yangzhou} \\ \hline \text{Middle School + University Student - University} & \text{Middle school student, Primary and secondary school students, Youth, Electronic design, Village official, Middle School No. 2} \\ \hline \text{RMB + USA - China} & \text{US Dollar, HKD, Equivalent to, USD, Depreciation, Ten thousand USD} \\ \hline \text{Terracotta Warriors + Dunhuang - Xi'an} & \text{Mogao Caves, Scrolls, Manuscripts, Library Cave, Exquisite, Thousand Buddha Caves} \\ \hline \end{array}\]

I would also like to clarify one point: regarding word analogy experiments, some look very beautiful while others seem unreliable, but in fact, word vectors reflect the statistical regularities of the corpus, which are objective. Conversely, some relations defined by humans are actually not objective. For a word vector model, if words are close, it means they have similar context distributions, not that we humanly define them as similar. Therefore, whether the effect is good depends on whether the viewpoint "similar context distribution ↔ similar words" (which is related to the corpus) matches the human definition of similarity (which is unrelated to the corpus and is subjective human thought). When you find the experimental results are not good, you might want to think about this point.

Ranking Related Words

Note equation $(15)$, which states that the mutual information between two words is equal to the inner product of their word vectors. A larger mutual information indicates a greater chance of the two words appearing together, while a smaller mutual information indicates the words are almost never used together. Therefore, we can use inner product ranking to find related words for a given word. Of course, the inner product includes the norm, and we just said the norm represents the importance of the word. If we ignore the importance and purely consider semantic meaning, we can normalize the vector norms before calculating the inner product; this approach is more stable: \[\cos\theta_{ij}=\left\langle \frac{\boldsymbol{v}_i}{\|\boldsymbol{v}_i\|}, \frac{\boldsymbol{v}_j}{\|\boldsymbol{v}_j\|}\right\rangle\tag{30}\]

From probability theory, we know that if mutual information is 0, it means the joint probability of two words is exactly the probability of their random combination, indicating they are independent words. Corresponding to equation $(15)$, this means the inner product of the two words is 0. According to vector knowledge, an inner product of 0 means the vectors are perpendicular. We usually say vectors being perpendicular means they are unrelated. So it is very clever: statistical independence between two words corresponds exactly to geometric independence. This is one of the elegances of the model's form.

It should be noted that, as mentioned before, stop words tend to shrink their norms rather than adjust their directions, so their directions do not have much meaning. We can assume the directions of stop words are random. At this time, when we look for related words through cosine values, unexpected stop words may appear.

Redefining Similarity

Note that what we discussed above is ranking related words. Related words and similar words (synonyms) are not the same thing!! For example, "single" and "frozen" (as in "frozen like a dog") are both very related to "dog," but they are not synonyms; "science" and "Development Outlook" are also very related, but they are not synonyms either.

So how do we find synonyms? In fact, this question is putting the cart before the horse, because the definition of similarity is human-made. For example, "like" and "love" are similar; what about "like" and "hate"? In a general topic classification task, they should be similar, but in a sentiment classification task, they are opposites. Another example is "run" and "grab"; usually we don't think they are similar, but in part-of-speech classification, they are similar because they share the same word class.

Returning to our hypothesis for the word vector model, it is the context distribution of a word that reveals its meaning. Thus, two similar words should have similar context distributions. The "Airport - Airplane + Train = Railway Station" analogy we discussed earlier was based on the same principle, but there it required strictly one-to-one context mapping, whereas here we only need approximate correspondence. The conditions are relaxed, and to adapt to different levels of similarity needs, the context here can also be chosen by ourselves. Specifically, for two given words $w_i, w_j$ and their corresponding word vectors $\boldsymbol{v}_i, \boldsymbol{v}_j$, to calculate their similarity, we first write their mutual information with $N$ pre-specified words, i.e., \[\langle\boldsymbol{v}_i,\boldsymbol{v}_1\rangle,\langle\boldsymbol{v}_i,\boldsymbol{v}_2\rangle,\dots,\langle\boldsymbol{v}_i,\boldsymbol{v}_N\rangle\tag{31}\] and \[\langle\boldsymbol{v}_j,\boldsymbol{v}_1\rangle,\langle\boldsymbol{v}_j,\boldsymbol{v}_2\rangle,\dots,\langle\boldsymbol{v}_j,\boldsymbol{v}_N\rangle\tag{32}\] Where $N$ is the total number of words in the vocabulary. If these two words are similar, then their context distributions should also be similar, so the two sequences above should have a linear correlation. Therefore, we might as well compare their Pearson product-moment correlation coefficient: \[\frac{\sum_{k=1}^N \Big(\langle\boldsymbol{v}_i,\boldsymbol{v}_k\rangle - \overline{\langle\boldsymbol{v}_i,\boldsymbol{v}_k\rangle}\Big)\Big(\langle\boldsymbol{v}_j,\boldsymbol{v}_k\rangle - \overline{\langle\boldsymbol{v}_j,\boldsymbol{v}_k\rangle}\Big)}{\sqrt{\sum_{k=1}^N \Big(\langle\boldsymbol{v}_i,\boldsymbol{v}_k\rangle - \overline{\langle\boldsymbol{v}_i,\boldsymbol{v}_k\rangle}\Big)^2}\sqrt{\sum_{k=1}^N \Big(\langle\boldsymbol{v}_j,\boldsymbol{v}_k\rangle - \overline{\langle\boldsymbol{v}_j,\boldsymbol{v}_k\rangle}\Big)^2}}\tag{33}\] Where $\overline{\langle\boldsymbol{v}_i,\boldsymbol{v}_k\rangle}$ is the mean of $\langle\boldsymbol{v}_i,\boldsymbol{v}_k\rangle$, i.e., \[\overline{\langle\boldsymbol{v}_i,\boldsymbol{v}_k\rangle}=\frac{1}{N}\sum_{k=1}^N \langle\boldsymbol{v}_i,\boldsymbol{v}_k\rangle=\left\langle\boldsymbol{v}_i,\frac{1}{N}\sum_{k=1}^N \boldsymbol{v}_k\right\rangle = \langle\boldsymbol{v}_i,\bar{\boldsymbol{v}}\rangle\tag{34}\] Thus, the correlation coefficient formula can be simplified to: \[\frac{\sum_{k=1}^N \langle\boldsymbol{v}_i,\boldsymbol{v}_k-\bar{\boldsymbol{v}}\rangle\langle\boldsymbol{v}_j,\boldsymbol{v}_k-\bar{\boldsymbol{v}}\rangle}{\sqrt{\sum_{k=1}^N \langle\boldsymbol{v}_i,\boldsymbol{v}_k-\bar{\boldsymbol{v}}\rangle^2}\sqrt{\sum_{k=1}^N \langle\boldsymbol{v}_j,\boldsymbol{v}_k-\bar{\boldsymbol{v}}\rangle^2}}\tag{35}\] Using matrix notation (assuming vectors are row vectors), we have: \[\begin{aligned}&\sum_{k=1}^N \langle\boldsymbol{v}_i,\boldsymbol{v}_k-\bar{\boldsymbol{v}}\rangle\langle\boldsymbol{v}_j,\boldsymbol{v}_k-\bar{\boldsymbol{v}}\rangle\\ =&\sum_{k=1}^N \boldsymbol{v}_i (\boldsymbol{v}_k-\bar{\boldsymbol{v}})^{\top}(\boldsymbol{v}_k-\bar{\boldsymbol{v}})\boldsymbol{v}_j^{\top}\\ =&\boldsymbol{v}_i \left[\sum_{k=1}^N (\boldsymbol{v}_k-\bar{\boldsymbol{v}})^{\top}(\boldsymbol{v}_k-\bar{\boldsymbol{v}})\right]\boldsymbol{v}_j^{\top}\end{aligned}\tag{36}\] What operation is in the square brackets? In fact, it is: \[\boldsymbol{V}^{\top}\boldsymbol{V},\,\boldsymbol{V}=\begin{pmatrix}\boldsymbol{v}_1-\bar{\boldsymbol{v}}\\ \boldsymbol{v}_2-\bar{\boldsymbol{v}}\\ \vdots \\ \boldsymbol{v}_N-\bar{\boldsymbol{v}}\end{pmatrix}\tag{37}\] That is, arranging the word vectors after subtracting the mean into a matrix $\boldsymbol{V}$, and then calculating $\boldsymbol{V}^{\top}\boldsymbol{V}$. This is an $n\times n$ real symmetric matrix where $n$ is the word vector dimension. It can be decomposed (e.g., Cholesky decomposition) as: \[\boldsymbol{V}^{\top}\boldsymbol{V}=\boldsymbol{U}\boldsymbol{U}^{\top}\tag{38}\] Where $\boldsymbol{U}$ is an $n\times n$ real matrix, so the correlation coefficient formula can be written as: \[\frac{\boldsymbol{v}_i \boldsymbol{U}\boldsymbol{U}^{\top}\boldsymbol{v}_j^{\top}}{\sqrt{\boldsymbol{v}_i \boldsymbol{U}\boldsymbol{U}^{\top}\boldsymbol{v}_i^{\top}}\sqrt{\boldsymbol{v}_j \boldsymbol{U}\boldsymbol{U}^{\top}\boldsymbol{v}_j^{\top}}}=\frac{\langle\boldsymbol{v}_i \boldsymbol{U},\boldsymbol{v}_j \boldsymbol{U}\rangle}{\Vert\boldsymbol{v}_i \boldsymbol{U}\Vert \times \Vert\boldsymbol{v}_j \boldsymbol{U}\Vert}\tag{39}\] We find that similarity is still measured using the cosine of the vectors, but only after being transformed by matrix $\boldsymbol{U}$.

Finally, how to choose these $N$ words? We can rank them by frequency in descending order and select the top $N$. If $N$ is chosen fairly large (e.g., $N=10000$), we get semantically related words in general scenarios, similar to the results in the previous section. If $N$ is chosen small, such as $N=500$, we get syntactically similar words; for example, at this time, "爬" (climb) becomes very close to "掏" (dig out), "捡" (pick up), and "摸" (touch).

Keyword Extraction

As in the article "The Incredible Word2Vec (3): Extracting Keywords", so-called keywords are words that can summarize the meaning of a sentence. That is, by looking only at the keywords, one can roughly guess the overall content of the sentence. Suppose a sentence has $k$ words $w_1, w_2, \dots, w_k$, then a keyword $w$ should be such that: \[P(w_1,w_2,\dots,w_k|w)\sim \frac{P(w_1,w_2,\dots,w_k;w)}{P(w_1,w_2,\dots,w_k)P(w)}\tag{40}\] is maximized. Simply put, the probability of guessing the sentence given the word is maximized. Since the sentence is pre-given, $P(w_1, w_2, \dots, w_k)$ is a constant, so maximizing the left side is equivalent to maximizing the right side. Continuing with the naive assumption, according to equation $(6)$, we have: \[\frac{P(w_1,w_2,\dots,w_k;w)}{P(w_1,w_2,\dots,w_k)P(w)}=\frac{P(w_1,w)}{P(w_1)P(w)}\frac{P(w_2,w)}{P(w_2)P(w)}\dots \frac{P(w_k,w)}{P(w_k)P(w)}\tag{41}\] Substituting our word vector model, we get: \[e^{\langle\boldsymbol{v}_1,\boldsymbol{v}_w\rangle}e^{\langle\boldsymbol{v}_2,\boldsymbol{v}_w\rangle}\dots e^{\langle\boldsymbol{v}_k,\boldsymbol{v}_w\rangle}=e^{\left\langle\sum_i \boldsymbol{v}_i, \boldsymbol{v}_w\right\rangle}\tag{42}\] So ultimately it is equivalent to maximizing: \[\left\langle\sum_i \boldsymbol{v}_i, \boldsymbol{v}_w\right\rangle\tag{43}\] Now the problem becomes simple. For a given sentence, sum the word vectors of all its words to get the sentence vector, and then calculate the inner product (or normalize it to get the cosine) between the sentence vector and each word vector in the sentence, then sort in descending order. This is simple and crude, and it reduces the algorithm's efficiency from $\mathcal{O}(k^2)$ to $\mathcal{O}(k)$. How about the results? Here are some examples.

Sentence: The second central environmental protection inspection team held a mobilization meeting in Hangzhou to inspect the work of Zhejiang Province. Starting from August 11 to September 11, the central environmental protection inspection team officially settled in Zhejiang. This also indicates that all enterprises in Zhejiang will face environmental inspections from the central team in the next month, which means they face risks of restricted production, production suspension, and shutdown.
Keyword Ranking: Inspection team, restricted production, inspection team, mobilization meeting, shutdown, inspection, production suspension, indicate, environmental protection, about to

Sentence: The Environmental Protection Bureau of Yiwu City, Zhejiang Province stated that due to high cadmium content in alloy raw materials, in order to control cadmium pollution, it is now ordering some electroplating enterprises in the city to carry out production rectification notice. The notified enterprises must stop production for rectification starting immediately. It is reported that Yiwu's low-temperature zinc alloy (zinc-cadmium alloy) has basically stopped production. Additionally, unaccepted electroplating enterprises in Ouhai District, Wenzhou City also received notices to stop production unconditionally from August 18 and only resume after passing acceptance. A new round of environmental protection has started in the downstream zinc enterprises in Zhejiang.
Keyword Ranking: Zinc alloy, Environmental Protection Bureau, production suspension, Ouhai District, permitted, cadmium, electroplating, ordering, Yiwu City, raw materials

Sentence: Boeuf Bourguignon is a classic French dish known as "the most delicious beef humans can cook." This dish has a rich wine aroma, attractive color, and the production process is not too troublesome. What are the stories behind it? How do you make a delicious Boeuf Bourguignon?
Keyword Ranking: Stewed beef, wine aroma, famous dish, delicious, attractive, Burgundy, rich, dish, cooking, one

Sentence: Astronomy experts introduced that the meteor shower will reach its peak around 0:30 on the 18th this year, with a zenithal hourly rate of about 10. There is no lack of bright bolides, which can be observed by the naked eye in most parts of China and even the Northern Hemisphere. The best observation time this year is from the early morning of the 17th to the 19th. Fortunately, there will be no moonlight interference, which is beneficial for observation.
Keyword Ranking: Meteor shower, bolide, Northern Hemisphere, zenith, observation, zero hour, naked eye, at that time, particles, early morning

It can be found that even for long sentences, this scheme is quite reliable. It is worth noting that while simple and crude, this keyword extraction scheme is not applicable to every type of word vector. GloVe vectors won't work because the norms of stop words in GloVe are larger, so GloVe results are exactly the opposite: words with smaller inner products (or cosine) are more likely to be keywords.

Sentence Similarity

Let's look at another example concerning sentence similarity, which many readers care about. In fact, it is similar to keyword extraction.

When are two sentences similar or even semantically equivalent? Simply put, if after reading the first sentence I know what the second sentence says, and vice versa. In this case, the correlation between the two sentences must be high. Suppose sentence $S_1$ has $k$ words $w_1, w_2, \dots, w_k$ and sentence $S_2$ has $l$ words $w_{k+1}, w_{k+2}, \dots, w_{k+l}$. Using the naive assumption and according to equation $(6)$, we get: \[\frac{P(S_1,S_2)}{P(S_1)P(S_2)}=\prod_{i=1}^k\prod_{j=k+1}^{k+l} \frac{P(w_i,w_j)}{P(w_i)P(w_j)}\tag{44}\] Substituting our word vector model, we get: \[\begin{aligned}\frac{P(S_1,S_2)}{P(S_1)P(S_2)}=&\prod_{i=1}^k\prod_{j=k+1}^{k+l} \frac{P(w_i,w_j)}{P(w_i)P(w_j)}\\ =&e^{\sum_{i=1}^k\sum_{j=k+1}^{k+l}\langle\boldsymbol{v}_i,\boldsymbol{v}_j\rangle}\\ =&e^{\left\langle\sum_{i=1}^k\boldsymbol{v}_i,\sum_{j=k+1}^{k+l}\boldsymbol{v}_j\right\rangle} \end{aligned}\tag{45}\] So the final ranking is equivalent to: \[\left\langle\sum_{i=1}^k\boldsymbol{v}_i,\sum_{j=k+1}^{k+l}\boldsymbol{v}_j\right\rangle\tag{46}\] The final result is also simple: just add all the word vectors of the two sentences to obtain their respective sentence vectors, then calculate the inner product (similarly, using cosine for normalization is also an option) to get the correlation between the two sentences.

Sentence Vectors

The previous two sections both implied that sentence vectors can be obtained directly by summing word vectors. So what is the quality of such sentence vectors?

We did a simple experiment. Using sentence vectors obtained by summing word vectors (non-truncated version) + a linear classifier (logistic regression), we achieved an accuracy of about 81% on a sentiment classification problem. If a hidden layer is added (structure: input 128 - word vector dimension, sentence vector is the sum and thus same dimension; hidden layer 64 with ReLU activation; output 1 for binary classification), we can get about 88% accuracy. In comparison, the accuracy of LSTM is about 90%, which shows that these sentence vectors are quite commendable. Note that this set of word vectors was trained on Baidu Baike, meaning they do not inherently reflect sentiment orientation, yet they still successfully and simply mined the sentiment tendencies of words.

At the same time, to verify the impact of truncation on vector quality, we repeated the experiment with the truncated version of the word vectors. The result was a maximum accuracy of 82% for logistic regression and 89% for the three-layer neural network. This shows that truncation (which significantly downweights high-frequency words) indeed captures semantics better.

import pandas as pd
import jieba

pos = pd.read_excel('pos.xls', header=None)
neg = pd.read_excel('neg.xls', header=None)
pos[1] = pos[0].apply(lambda s: jieba.lcut(s, HMM=False))
neg[1] = neg[0].apply(lambda s: jieba.lcut(s, HMM=False))
pos[2] = pos[1].apply(w2v.sent2vec) # This w2v.sent2vec function refers to the next article
neg[2] = neg[1].apply(w2v.sent2vec)
pos = np.hstack([np.array(list(pos[2])), np.array([[1] for i in pos[2]])])
neg = np.hstack([np.array(list(neg[2])), np.array([[0] for i in neg[2]])])
data = np.vstack([pos, neg])
np.random.shuffle(data)

from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(64, input_shape=(w2v.word_size,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
 optimizer='adam',
 metrics=['accuracy'])

batch_size = 128
model.fit(data[:16000,:w2v.word_size], data[:16000,[w2v.word_size]],
 batch_size=batch_size,
 epochs=100,
 validation_data=(data[16000:,:w2v.word_size], data[16000:,[w2v.word_size]]))