Minimum Entropy Principle (II): Word Bank Construction via "Prompt Decisions"

By 苏剑林 | April 24, 2018

In this article, we introduce the first move of our "Routine Bible"—"Prompt Decisions": 1. Derive the concept of average character information entropy, then deduce the mutual information formula based on the minimum entropy principle; 2. Complete the unsupervised construction of a word bank and provide an information entropy interpretation of the unigram segmentation model, thereby demonstrating the basic methods and techniques for generating and identifying "routines" (patterns).

This is both the first use case of the minimum entropy principle and the general outline for the entire "Routine Bible."

Whether you practice it or not, the routine is there, neither increasing nor decreasing.

Why do we need words?

As seen in the previous article, assuming we don't understand Chinese at all, we initially view it as a series of randomly combined strings of "characters." However, we gradually discover that the context is connected; it isn't a random combination of characters, but rather a random combination of "routines." To reduce our memory burden, we attempt to mine these linguistic "routines." The first "routine" is the fixed combination of adjacent characters, which are what we understand as "words."

Average character information entropy

Suppose there is a corpus. If we segment it into words and treat the word as the unit of Chinese, the information content of each word is $-\log p_w$. Thus, we can calculate the time spent to memorize this corpus as:

\[-\sum_{w\in \text{corpus}}\log p_w\tag{2.1}\]

Here, $w\in \text{corpus}$ denotes summing over every word occurrence in the corpus without deduplication. If we do not segment words and instead understand it by characters, the time required is:

\[-\sum_{c\in \text{corpus}}\log p_c\tag{2.2}\]

Based on our previous understanding, word segmentation is intended to reduce memory burden, so theoretically, it should hold that:

\[-\sum_{w\in \text{corpus}}\log p_w < -\sum_{c\in \text{corpus}}\log p_c\tag{2.3}\]

Since many words appear repeatedly, we can merge identical terms:

\[-\sum_{w\in \text{vocabulary}}N_w\log p_w < -\sum_{c\in \text{character list}}N_c\log p_c\tag{2.4}\]

Where $N_w, N_c$ are the frequencies of words and characters obtained from the corpus. Dividing both sides of equation $(2.4)$ by the total number of characters in the corpus, the right side becomes:

\[-\sum_{c\in \text{character list}}\frac{N_c}{\text{total characters}}\log p_c = -\sum_{c\in \text{character list}}p_c \log p_c\tag{2.5}\]

Where $N_c/\text{total characters} = p_c$ is the frequency of character $c$. Thus, the above expression is the average information per character. The left side of $(2.4)$ can be transformed into:

\[\begin{aligned}\mathcal{L} =& - \sum_{w\in \text{vocabulary}}\frac{N_w}{\text{total characters}}\log p_w \\ =& \left(-\sum_{w\in \text{vocabulary}}\frac{N_w}{\text{total words}}\log p_w\right)\div \left(\frac{\text{total characters}}{\text{total words}}\right)\\ =&\left(-\sum_{w\in \text{vocabulary}}\frac{N_w}{\text{total words}}\log p_w\right)\div \left(\frac{\sum\limits_{w\in \text{vocabulary}}N_w l_w}{\text{total words}}\right)\\ =&\left(-\sum_{w\in \text{vocabulary}}\frac{N_w}{\text{total words}}\log p_w\right)\div \left(\sum\limits_{w\in \text{vocabulary}}\frac{N_w}{\text{total words}} l_w\right)\\ =&\frac{\mathcal{H}}{l}\end{aligned}\tag{2.6}\]

Where $N_w/\text{total words} = p_w$ is the frequency of word $w$, and $l_w$ is the number of characters in word $w$. Thus, $\mathcal{H}$ is the average information content per word, and $l$ is the average number of characters per word:

\[\mathcal{H} = -\sum_{w\in\text{vocabulary}} p_w\log p_w,\quad l=\sum_{w\in\text{vocabulary}} p_w l_w\tag{2.7}\]

Therefore, $\mathcal{L}$ is effectively the average information content per character when memorizing by words, and it is the ultimate goal we wish to compare and optimize. We convert total information quantity into average information per character to obtain a unified measurement standard.

Enrich your vocabulary

Does word segmentation truly reduce learning difficulty as expected? I performed a simple statistical analysis. After basic word segmentation on articles from WeChat public accounts, $\mathcal{H}\approx 10.8$ bits, and the average word length is 1.5 characters, meaning $l\approx 1.5$. Thus $\mathcal{L}=7.2$ bits, which is significantly smaller than the 9.65 bits required for character-by-character memorization. This demonstrates that in Chinese, memorizing by words is indeed easier than memorizing by characters.

The operation of "word segmentation" reduces the learning difficulty of Chinese, which is why "word segmentation" is typically the first step in Chinese NLP.

Simply put, after segmenting a language, the vocabulary size we need to memorize increases, but the length of each sentence decreases. Overall, the learning difficulty is reduced. Therefore, it is easy to understand why learning English requires memorizing words; enriching our vocabulary reduces the difficulty of learning the language.

How routines are forged

Conversely, we can use the minimization of $\mathcal{L}$ to guide the unsupervised construction of the word bank. This is the core content of the lesson "How Routines Are Forged."

Routines start from the local

First, we localize the average character information entropy $\mathcal{L}$. Localization means examining which two basic elements, when merged into a new element, will lead to a decrease in $\mathcal{L}$. We can then iteratively merge to achieve the goal of entropy minimization. "Basic elements" is used here to generalize the problem, as characters can merge into words, words into phrases, and so on. In this iterative process, the "basic elements" change.

"Localization" is the foundation for most subsequent content. Although the derivation process is lengthy, it involves relatively simple operations that can be understood with a bit of thought.

Suppose the current frequency of $i$ is $N_i$ and the total frequency is $N$, then we estimate $p_i = N_i/N$. If $i$ has $l_i$ characters, we can calculate the current:

\[\mathcal{L}=\frac{\mathcal{H}}{l}=\frac{-\sum\limits_i p_i\log p_i}{\sum\limits_i p_i l_i}\tag{2.8}\]

What if we merge two adjacent elements $a,b$ into one? Suppose the frequency of $(a,b)$ is $N_{ab}$, then before merging, we estimate $p_{ab} = N_{ab}/N$. If we treat them as a single "word," the total frequency actually decreases to $\tilde{N} = N - N_{ab}$, and $\tilde{N}_a = N_a - N_{ab}$, $\tilde{N}_b = N_b - N_{ab}$, while other frequencies remain unchanged. Thus, we can re-estimate each frequency as:

\[\begin{aligned}\tilde{p}_{ab}=&\frac{N_{ab}}{\tilde{N}}=\frac{p_{ab}}{1-p_{ab}}\\ \tilde{p}_{a}=&\frac{\tilde{N}_{a}}{\tilde{N}}=\frac{p_a - p_{ab}}{1-p_{ab}},\,\tilde{p}_{b}=\frac{\tilde{N}_{b}}{\tilde{N}}=\frac{p_b - p_{ab}}{1-p_{ab}}\\ \tilde{p}_{i}=&\frac{N_{i}}{\tilde{N}}=\frac{p_i}{1-p_{ab}},\, (i\neq a,b) \end{aligned}\tag{2.9}\]

Hence:

\[\begin{aligned}\tilde{\mathcal{H}}=&-\frac{1}{1-p_{ab}}\Bigg\{p_{ab}\log\Big(\frac{p_{ab}}{1-p_{ab}}\Big) + \\ &\qquad \sum_{i=a,b}(p_i - p_{ab})\log\Big(\frac{p_i - p_{ab}}{1-p_{ab}}\Big)+\sum_{i\neq a,b} p_i\log \Big(\frac{p_i}{1-p_{ab}}\Big)\Bigg\}\\ =&\frac{1}{1-p_{ab}}(\mathcal{H}-\mathcal{F}_{ab})\end{aligned}\tag{2.10}\]

where

\[\begin{aligned}\mathcal{F}_{ab}= &p_{ab}\log \frac{p_{ab}}{p_a p_b} -(1-p_{ab})\log(1-p_{ab}) \\ &\+ \sum_{i=a,b}(p_i-p_{ab})\log\Big(1-\frac{p_{ab}}{p_i}\Big)\end{aligned}\tag{2.11}\]

And

\[\begin{aligned}\tilde{l}=&\frac{p_{ab}}{1-p_{ab}}(l_a + l_b) + \sum_{i=a,b}\frac{p_i - p_{ab}}{1-p_{ab}}l_i + \sum_{i\neq a,b} \frac{p_i}{1-p_{ab}} l_i\\ =&\frac{l}{1-p_{ab}} \end{aligned}\tag{2.12}\]

Therefore:

\[\frac{\tilde{\mathcal{H}}}{\tilde{l}}-\frac{\mathcal{H}}{l}=-\frac{\mathcal{F}_{ab}}{l}\tag{2.13}\]

Our goal is to make $\tilde{\mathcal{H}}/\tilde{l}$ smaller, so clearly, a good "routine" should result in $\mathcal{F}_{ab} \gg 0$.

Simple and elegant approximation

The expression for $\mathcal{F}_{ab}$ is too complex to easily discern patterns, so we can make some approximations. $p_{ab} \leq p_a, p_b$ always holds, and in many cases, we can assume $p_{ab}\ll p_a, p_b$. If we use natural logarithms, then:

\[\begin{aligned}&\ln(1-p_{ab})\approx -p_{ab}\\ &\ln\Big(1-\frac{p_{ab}}{p_i}\Big)\approx -\frac{p_{ab}}{p_i}\end{aligned}\tag{2.14}\]

Because these approximations require the natural logarithm ($\ln(1+x)\approx x$), we will change all subsequent $\log$ to $\ln$. Substituting these into the expression for $\mathcal{F}_{ab}$ and discarding terms of order $p_{ab}^2$ and higher, we get:

\[\mathcal{F}_{ab}\approx F_{ab}^*=p_{ab} \left(\ln \frac{p_{ab}}{p_{a} p_{b}}-1\right)\tag{2.15}\]

This indicator is much simpler and more elegant. $PMI(a, b)=\ln\frac{p_{ab}}{p_{a} p_{b}}$ is what we call pointwise mutual information.

Using Taylor series, a more general expansion can be obtained: \[\mathcal{F}_{ab} = p_{ab} \left(\ln \frac{p_{ab}}{p_{a} p_{b}}-1\right)+\frac{1}{2}\left(\frac{1}{p_a}+\frac{1}{p_b}-1\right)p_{ab}^2+\dots\] It can be seen (and strictly proven) that the approximation $F_{ab}^*$ is always smaller than the true value of $\mathcal{F}_{ab}$. Thus, $F_{ab}^* = p_{ab} \left(\ln \frac{p_{ab}}{p_{a} p_{b}}-1\right)\gg 0$ is actually a sufficient condition for $\mathcal{F}_{ab}\gg 0$.

The above discussion covers the merging of two elements. If merging $k$ basic elements $\boldsymbol{a}=(a_1,\dots,a_k)$, we can similarly derive:

\[\begin{aligned}\mathcal{F}_{\boldsymbol{a}}= &p_{\boldsymbol{a}}\ln \frac{p_{\boldsymbol{a}}}{\prod\limits_{i\in \boldsymbol{a}} p_i} -\big[1-(k-1)p_{\boldsymbol{a}}\big]\ln\big[1-(k-1)p_{\boldsymbol{a}}\big] \\ &\qquad+ \sum_{i\in\boldsymbol{a}}(p_i- p_{\boldsymbol{a}})\ln\Big(1-\frac{ p_{\boldsymbol{a}}}{p_i}\Big)\end{aligned}\tag{2.16}\]

Its approximation formula is:

\[\mathcal{F}_{\boldsymbol{a}}\approx \mathcal{F}_{\boldsymbol{a}}^* = p_{\boldsymbol{a}} \left(\ln \frac{p_{\boldsymbol{a}}}{\prod\limits_{i\in\boldsymbol{a}} p_i}-1\right)\tag{2.17}\]

Derivation ends, word bank appears

Now we see that to minimize the objective $\tilde{\mathcal{H}}/\tilde{l}$, we should have $\mathcal{F}_{ab} \gg 0$. Since a lower bound approximation for $\mathcal{F}_{ab}$ is $F_{ab}^*$, we can use $F_{ab}^* \gg 0$ to determine which elements should be merged. In the context of building a word bank, $F_{ab}^* \gg 0$ tells us which characters should be combined into words.

Based on the characteristics of $F_{ab}^*$, a necessary condition for $F_{ab}^* \gg 0$ is $\ln \frac{p_{ab}}{p_{a} p_{b}} > 1$. That is, the mutual information must at least be greater than 1. Under this condition, higher mutual information is better, and higher co-occurrence frequency is also better. This tells us to judge merging from both spectral frequency and mutual information perspectives.

When it comes to building a word bank, there is a clever trick: the "inverse application" of $F_{ab}^*$. The formula $F_{ab}^*$ tells us that when $\mathcal{F}_{ab}^* \gg 0$, the two elements should be merged. Conversely, if $\mathcal{F}_{ab}^*$ is less than a certain $\theta$, should it be split? (Using reverse thinking, shifting from determining what to "merge" to determining what to "split"—so-called "Prompt Decisions"). This way, we only need to use the two-element merge formula to guide a rough segmentation of the corpus, then statistically filter the results to obtain many words.

Thus, we obtain a simple and effective algorithm for building a word bank:

1. Statistics: Count the frequency of each character ($p_a, p_b$) and the co-occurrence frequency of adjacent characters ($p_{ab}$) from a large corpus.

2. Segment: Set a frequency threshold $min\_prob$ and a mutual information threshold $min\_pmi$. Split adjacent characters where $p_{ab} < min\_prob$ or $\ln\frac{p_{ab}}{p_a p_b} < min\_pmi$ ("Prompt Decisions," equivalent to a rough word segmentation).

3. Truncate: After splitting in step 2, count the frequency $p_{w'}$ of each "candidate word" and only keep those where $p_{w'} > min\_prob$.

The dictionary obtained from these three steps still has some redundancy. This is reflected in the fact that if a unigram segmentation model is built using this dictionary, some "words" in the dictionary will never be segmented because they can be represented by their sub-words, and the product of the sub-words' probabilities is greater than the probability of the word itself. For example, in experiments, because the mutual information of both "的, 主" (of, main/major) and "主, 要" (major, want/essential) is greater than 1, the "word" "的主要" (major of) appeared in the word bank. However, statistics showed that $p(\text{的主要}) < p(\text{的})p(\text{主要})$. Therefore, this word would never be produced in unigram segmentation. Keeping such "words" in the word bank only wastes memory. Thus, "的主要" should be removed, and its frequency should be added to the frequencies of its constituent words, "的" and "主要".

Therefore, based on this principle, we can filter out some candidate words:

4. De-redundancy: Arrange the candidate words from longest to shortest by character count. One by one, temporarily remove a candidate word from the word bank, segment it using the remaining words and frequencies, and calculate the mutual information between the original word and its sub-words. According to equation $(2.17)$, if the mutual information is greater than 1, restore the word; otherwise, keep it deleted and update the frequencies of the resulting sub-words.

The effect of this "de-redundancy" step is quite significant; it can remove 30% or even over 50% of "unqualified" candidate words. Below are some of the filtered results (100 total):

的研究, 上的, 的一个, 的方法, 是在, 在中国, 的主要, 的一, 性的, 系统的, 是中国, 发展的, 方面的, 的方式, 的社会, 以上的, 传统的, 化的, 我们的, 的功能, 的需要, 的各种, 中国人, 的形成, 问题的, 为一, 物质的, 名的, 产品的, 的数据, 里的, 组织的, 时期的, 来的, 的结构, 的信息, 式的, 好的, 的一些, 的传统, 文化的, 的标准, 的目标, 主要的, 为中国, 和社会, 人们的, 规定的, 的规定, 的支持, 的最大, 在一个, 主义的, 的价值, 所谓的, 国家和, 的主, 的特征, 量的, 能力的, 在中, 世界的, 系统中, 者的, 的空间, 成立的, 的最高, 的方向, 是不, 一般的, 是通过, 建立的, 在他, 的最后, 后来的, 体的, 的重大, 等方面的, 这里的, 力的, 领域的, 人之, 具体的, 的反应, 度的, 明显的, 面的, 的电子, 的一切, 民族的, 的数量, 的发现, 技术和, 早期的, 的发生, 知识的, 是美国, 项目的, 的优势, 在一

It can be observed that, except for "中国人" (Chinese person/people), most of these are indeed what we recognize as "unqualified" words.

PS: The "de-redundancy" step requires some manual rules to down-weight long words, because limited corpora often result in insufficient statistics for long words, leading to overestimated probabilities. Since these are "manual rules," I leave them to the reader's discretion and will not list them explicitly here.

Is it really that simple?

Of course, even with these four steps, the algorithm might seem too simple, potentially causing skepticism regarding its effectiveness. Some readers might think: $\ln\frac{p_{ab}}{p_a p_b} < min\_pmi$ does not imply $\ln\frac{p_{abc}}{p_a p_b p_c} < min\_pmi$. That is, even if the mutual information between two adjacent characters is low, the mutual information among three characters might be high. Isn't splitting based solely on character pairs too arbitrary?

There are indeed such cases. For example, in some corpora, in the word "共和国" (republic), the mutual information between "和" and "国" is low, but the mutual information among the three characters is quite high. Similarly for names like "林心如" (Ruby Lin), the mutual information between "心" and "如" is small. However, in practice, for building Chinese word banks, such cases are rare. Unlike English, where the basic units are 26 letters (producing 676 pairs and roughly 10,000 triplets), Chinese has thousands of characters. Theoretically, character pairs (2-grams) already number in the millions. Thus, statistics for 2-grams are usually sufficient to reflect the combinatorial features of characters. If such errors do occur, it is likely because those words are not significant in the input corpus—basically, an error is just an error.

Therefore, we generally do not need to discuss merging three or more elements. The adjacent character discrimination algorithm is sufficient for most scenarios. Attempting to improve it would inevitably require introducing third-order or higher-order language models and possibly dynamic programming, which would greatly reduce algorithm efficiency. In other words, while improvement ideas exist, the cost of implementing them is quite high.

Recognizing the first type of routine

Now we can identify some "routines" in natural language—words. The next question is: given a sentence, how do we identify the routines within it? For the routine of "words," this essentially means: once we have a word bank and word frequencies, how do we perform word segmentation?

A new interpretation of unigram segmentation

With the previous discussion, it becomes quite simple. From beginning to end, we look for routines to ease memory burden and reduce the total information content of the corpus. We have:

\[-\sum_{w\in \text{corpus}}\ln p_w = -\sum_{w\in \text{sentence}\subset \text{corpus}}\ln p_w\tag{2.18}\]

That is, minimizing the total information content is equivalent to minimizing the information content of each sentence. Thus, for a given sentence, the optimal segmentation—assume it is $w_1, w_2, \dots, w_n$—should minimize the total information content of that sentence:

\[-\sum_{k=1}^n \ln p(w_k)=-\ln\prod_{k=1}^n p(w_k)\tag{2.19}\]

This is actually the unigram word segmentation model—maximizing the product of individual word probabilities—but here we have given it a new interpretation through the minimum entropy principle.

Core dynamic programming

The unigram segmentation model can be solved using dynamic programming. Dynamic programming is one of the core algorithms in NLP; it is a method for finding the optimal path in a graph model that satisfies the Markov property.

The general shortest path problem has high complexity, but due to the existence of the Markov assumption, an efficient $\mathcal{O}(n)$ algorithm exists. This is called dynamic programming, also known as the Viterbi algorithm, proposed by Andrew Viterbi in 1967. Its fundamental idea is: if an optimal path is split into two parts, each part is also a (locally) optimal path.

Specifically, in shortest path segmentation, dynamic programming means: scanning the sentence from left to right. When reaching the current character, keep only the optimal segmentation up to identifying that character, as well as any "quasi-optimal sequences" passing through that character (to avoid premature incorrect cuts).

For example, for the sentence “中外科学名著” (Chinese and foreign scientific classics), first scan “中”. It is the only segmentation result and is saved. The word “中外” (Chinese-foreign) also passes through “中”, so it must also be saved. That is, at the first character, the temporary variables store “中” and “中外”. Second, scan the character “外”. We know “中外” can be split into “中 / 外” or kept as “中外”. Naturally, the latter has higher probability, so “中 / 外” is discarded, leaving “中外”. However, the word “外科” (surgery) also passes through “外”, so we must keep the possibility of “中 / 外科”. Thus, at the second character, the candidates are “中外” and “中 / 外科”.

Next comes “科”. Since the probability of “中外 / 科” is lower than “中 / 外科”, the optimal segmentation at this point is “中 / 外科”. However, the word “科学” (science) also passes through “科”, so we also keep the possibility of “中外 / 科学”. By analogy, upon reaching the final character, one can successfully segment “中外 / 科学 / 名著”. The overall principle is: preserve the optimal plan up to the current character and do not ignore any words passing through the current character (to avoid premature mis-cutting).

Here, the author feels that the importance of dynamic programming in NLP cannot be overstated. Whether you are dealing with traditional machine learning or deep learning, you must master the dynamic programming algorithm—not just know how to use it, but how to write it completely from scratch (combined with Trie trees or AC automata).

Decoding the beauty of language

Although we have only taken the preliminary step of unsupervised word bank construction, the implications go far beyond this. Readers might brainstorm: what we have done is something quite miraculous. We did not manually summarize rigid rules to build a word bank; instead, we identified the objective of average character information entropy, attempted to minimize it, and the word bank emerged.

This seems to have the flavor of "reenacting civilization." We are like naive children playing aimlessly for our own entertainment, yet eventually succeeding in decoding the mysteries of language.

Readers who have done the experiments might argue: many of the things produced are not "words" we recognize! My view is that as long as we give the model the ability and direction to evolve, it doesn't really matter what the resulting evolution looks like. For instance, in word segmentation, we don't segment for the sake of segmenting; we segment to prepare for subsequent tasks. If so, why care if the segmented words are "accurate"? As stated earlier, the purpose of segmentation is simply to reduce the task's difficulty. What we should truly care about is the capacity to handle the final task.

This might sound a bit like a cop-out, but in fact, this is where the true flavor of artificial intelligence lies—if a group of truly intelligent robots slowly developed civilization from scratch, why should their results be the same as ours?

Of course, segmentation is only the first step. The root cause of errors is that we only focused on segmentation without performing more complex structural and semantic analysis. We will try to do more work and derive more "routines" later. Readers will gradually discover that the results we eventually obtain do indeed align with linguistic phenomena.