The Principle of Minimum Entropy (III): "Flying Elephant Across the River" — Sentence Templates and Language Structures

By 苏剑林 | May 30, 2018

In the previous article, "The Principle of Minimum Entropy (II): building a lexicon via 'Decisiveness'", we used the principle of minimum entropy as a starting point for a series of mathematical derivations, eventually arriving at equations $(2.15)$ and $(2.17)$. These tell us that elements with high mutual information should be merged, as this helps reduce "learning difficulty." Using this principle, we achieved unsupervised lexicon generation through adjacent character mutual information.

From characters to words, and from words to phrases, we have been examining whether adjacent elements can be merged into a good "pattern." But why must patterns be adjacent? Of course, they don't have to be. When we learn a language, we learn not only words and phrases but also "fixed collocations"—that is, how words are used together reasonably. This is the manifestation of grammar, which is what this article aims to explore, hoping to eventually achieve a degree of unsupervised syntactic analysis.

Since we are considering language associations across non-adjacent words, I have named it "Flying Elephant Across the River" (a move in Chinese chess where the elephant jumps across the central line), precisely:

The second movement of the "Tome of Patterns" — "Flying Elephant Across the River"

Language Structure

For most people, they don't really know what grammar is; in their minds, there are only "fixed collocations," "fixed formulas," or more formally, "templates." In most cases, we speak reasonable sentences based on templates. Different people may have different speaking templates, which represent personal speaking styles or even "catchphrases."

Sentence Templates

For example, "What is the Y of X" is a simple template. It has specific words like "is," "the," "of," "what," and placeholders X and Y. By simply substituting two nouns for X and Y, we get a grammatically correct sentence (whether it is factual or not is another matter). Many such templates can be listed: "X and Y," "the Y of X," "Can X do Y?", "What are the Ys of X?", "Is it X, Y, or Z?", and so on.

Sentence template and nested usage example

Of course, while we can extract as many templates as possible, a finite number of templates cannot cover the ever-changing possibilities of language. Therefore, it is more important to use templates based on nesting. For example, in the template "What is the Y of X," X can be replaced by the template "A and B," resulting in "What is the Y of (A and B)?" In this way, through nested templates, a vast number of sentences can be obtained.

Equivalence Classes

Next, once we have the template "What is the Y of X?", how do we know what can be filled in X and Y?

We just said "substitute two nouns," but according to our approach so far, we only know how to build a lexicon; we don't even know what a "noun" is, let alone that a noun should be filled in. In fact, we don't need to know anything in advance. We can extract "equivalence classes" for each candidate position through a large corpus, where the candidate words for X form one equivalence class, the candidates for Y form another, and so on.

Sentence templates and the concept of equivalence classes

Of course, such a plan is quite ideal. In fact, the raw corpora we currently obtain are in much worse condition. But regardless, "a skyscraper starts from the ground"; we will first solve the ideal case and consider general cases during actual use.

Below we will explore, one by one, how to obtain sentence templates from large amounts of raw corpora, how to recognize the sentence templates used in sentences, and even how to mine the hierarchical structure of sentences.

Generating Templates

In fact, with the experience of building a lexicon from the previous article, it is not difficult to conceive an algorithm for generating sentence templates.

In lexicon construction, our statistical objects were characters; now, our statistical objects are words. Furthermore, words are composed of adjacent characters, but sentence templates are not necessarily composed of adjacent words (otherwise they would degenerate into words or phrases). Therefore, we also need to consider cross-word co-occurrence, which is the Skip-Gram model used in Word2Vec.

Directed Acyclic Graph

A Directed Acyclic Graph (DAG) is a graph theory model frequently encountered in NLP. In fact, the unigram word segmentation model can be abstractly viewed as a shortest path planning problem on a DAG. The construction of the candidate template set here also requires the use of a DAG.

Because we consider the Skip-Gram model, we can connect "word pairs" within a sentence that are relatively "tight" (high mutual information). From a graph theory perspective, this forms a "Directed Acyclic Graph":

Directed acyclic graph formed by sentence and Skip-Gram relationships

We extract all paths on the graph. If a path skips adjacent nodes, we insert a placeholder (denoted as X below). This generates a candidate template set. For example, from the graph above, extracted candidate templates might include:

Computer X Mouse X
X's X has what X's X?
X's X has what X?
X is relatively X

Algorithm Steps

We can specifically describe the above process as follows:

1. Segment the corpus into sentences and tokenize them;
2. Select a window size window. Count individual word frequencies ($p_a, p_b$) and co-occurrence frequencies ($p_{ab}$) for Any two words within the window size in the corpus;
3. Set a threshold for occurrence frequency $min\_prob$ and a threshold for mutual information $min\_pmi$;
4. Traverse all sentences:
    4.1. Construct a graph for each sentence, where words in the sentence are nodes;
    4.2. For word pairs $(a, b)$ within the window in the sentence, if $p_{ab} > min\_prob$ and $\ln\frac{p_{ab}}{p_a p_b} > min\_pmi$, add a directed edge "a --> b" to the graph;
    4.3. Find all paths on the graph (isolated points also count as paths) and add them to the statistics as candidate templates.
5. Count the frequency of each "quasi-template," sort them in descending order by frequency, and take the top portion.

This algorithm can be used for extracting sentence templates and also simply for extracting phrases (collocations) by setting the window to 1. Therefore, it essentially encompasses the lexicon construction mentioned in the previous article; thus, the algorithm above is a generalized extraction framework.

Effect Demonstration

Below are some sentence templates extracted from the Baidu Zhidao question set (the numbers are the calculated frequencies, which can be ignored):

< Template: [X] 的 [X] > 20199
< Template: [X] 吗 > 9695
< Template: [X] 是 [X] > 5358
< Template: [X] 的 > 3979
< Template: [X] 和 [X] > 3919
< Template: 我 [X] > 3766
< Template: [X] 有 [X] > 3568
< Template: [X] 了 > 2910
< Template: [X] 了 [X] > 2702
< Template: [X] 怎么 [X] > 2340
< Template: [X] 到 [X] > 2254
< Template: [X] 在 [X] > 2234
< Template: [X] 我 [X] > 2147
< Template: [X] 不 [X] > 1708
< Template: 求 [X] > 1547
< Template: [X] 怎么样 > 1371

Note that templates like "X 的 X" and "X 怎么 X," where a word is sandwiched between two placeholders, are trivial; they simply tell us that the word can be inserted into a sentence. Therefore, to see the effect more clearly, we exclude this type of template and get:

< Template: [X] 吗 > 9695
< Template: [X] 的 > 3979
< Template: 我 [X] > 3766
< Template: [X] 了 > 2910
< Template: 求 [X] > 1547
< Template: [X] 怎么样 > 1371
< Template: [X] 啊 > 1324
< Template: 有 [X] > 1319
< Template: [X] 怎么办 > 1232
< Template: 为什么 [X] > 1220
< Template: 请问 [X] > 1189
< Template: [X] 呢 > 1099
< Template: [X] 么 > 1003
< Template: 谢谢 > 997
< Template: 怎么 [X] > 903
< Template: 在 [X] > 894
< Template: 现在 [X] > 874
< Template: 如何 [X] > 867
< Template: [X] 好 > 798
< Template: [X] 是 [X] 意思 > 728
< Template: 是 [X] > 727
< Template: [X] 是什么意思 > 721
< Template: [X] 是什么 > 684
< Template: 怎么办 > 589
< Template: 有没有 [X] > 582
< Template: [X] 多少钱 > 550
< Template: 从 [X] > 526
< Template: 什么 [X] > 522
< Template: [X] 有哪些 > 490
< Template: [X] 是什么 [X] > 483

From the results, our sentence template generation is indeed effective, as these templates help us discover rules of language usage. For example:

1. Templates like "[X] 吗", "[X] 了", "[X] 怎么样" have placeholders at the beginning, indicating these words can be placed at the end of a question (the corpus used consists of questions).
2. Templates like "我 [X]", "求 [X]", "为什么 [X]", "请问 [X]" have placeholders at the end, indicating these words can be placed at the beginning of a question.
3. Templates like "谢谢" and "怎么办" have no placeholders, indicating they can stand alone as sentences.
4. Templates like "[X] 是 [X] 意思" and "[X] 是什么 [X]" reflect fixed collocations in the language.

From a general perspective, these templates describe syntactic-level linguistic phenomena. Of course, to avoid confusion with current mainstream syntactic analysis, we might as well call them language structural patterns, or directly "sentence templates."

Structural Analysis

Just like word segmentation, once sentence templates are constructed, we need an algorithm to identify which templates are used in a sentence. Only by achieving this step can we identify equivalence classes of words from the corpus.

Looking back at the word segmentation algorithm, word segmentation is just a problem of sentence partitioning. The words partitioned out have no "holes" (placeholders). However, if we want to identify which templates are used in a sentence, templates have "holes" and can be nested, which makes identification difficult. Yet, once we can accomplish this, we obtain a hierarchical structural decomposition of the sentence, which is a very attractive goal.

Projectivity Hypothesis

To implement hierarchical decomposition of sentences, we can first borrow the "projectivity hypothesis" commonly used in syntactic analysis.

Projectivity in language roughly means that if a sentence can be divided into several "semantic blocks," these blocks do not overlap. That is, if words 1, 2, and 3 form one semantic block, and words 4 and 5 form another, this is allowed. But if words 1, 2, and 4 form a semantic block and words 3 and 5 form another, this is impossible. Most languages, including Chinese and English, basically satisfy projectivity.

Structural Hypothesis

To complete the hierarchical structural decomposition of a sentence, we need to make a more complete hypothesis about the compositional structure of the sentence. Inspired by the projectivity hypothesis, I believe we can make the following assumptions about sentence structure:

1. Each semantic block is a continuous substring of the sentence, and the sentence itself is also a semantic block;
2. Each semantic block is generated by a primary sentence template, where the placeholder parts of the template are also semantic blocks;
3. Each individual word can be viewed as a trivial sentence template and also as a semantic block of the smallest granularity.

In short, these three assumptions can be summarized in one sentence:

Every sentence is generated by the mutual nesting of sentence templates.

At first glance, this assumption may seem unreasonable, but careful thought reveals it is sufficient to describe the structure of most sentences. Readers might have doubts like: "Is it possible to use two sentence templates in parallel rather than nested?" The answer is: probably not. Because if this happens, the "parallelism" itself can be viewed as a template. For example, by treating "X and X" as a template, the two semantic blocks in this template are parallel; it can even be nested with itself as "X and (X and X)" to describe more parallel phenomena.

Precisely because we made this assumption about language structure, once we identify the optimal sentence template combination for a sentence, we obtain its hierarchical structure—because per the assumption, templates are combined via nesting, nesting implies recursion, and recursion is a tree-like hierarchical structure.

Decomposition Algorithm

With the structural hypothesis, we can describe the template recognition algorithm. First, let's restate the word segmentation algorithm. The idea of unigram word segmentation is:

Segment a sentence into words such that the sum of the logarithms of their probabilities is maximized (the sum of information is minimized).

It can also be expressed as follows:

Find a series of words to cover every character in the sentence without repetition or omission, such that the sum of the logarithms of their probabilities is maximized (the sum of information is minimized).

Previously, we thought of segmentation as cutting the sentence; this equivalent expression reverses it, requiring coverage of the sentence. With this reverse thinking, we can propose a template recognition algorithm:

Find a series of sentence templates to cover every word in the sentence without repetition, omission, or overlap, such that the sum of the logarithms of their probabilities is maximized (the sum of information is minimized).

Of course, this is just the idea. In the implementation, the main difficulty is handling placeholders—that is, each word in the sentence represents both itself and a potential placeholder. This duality makes scanning and recognition difficult. Fortunately, following the structural assumptions above, we can transform it into a recursive calculation:

In the optimal structural decomposition scheme, the decomposition scheme for each semantic block under the main template is also optimal.

Hierarchical structure analysis of a sentence, containing nested template calls

Thus, we get the algorithm:

1. Scan for all possible templates appearing in the sentence (this can be done quickly using a Trie structure);
2. The score of each decomposition scheme equals the score of the main template of the sentence plus the score of the optimal decomposition scheme for each semantic block.

Result Display

Below are demonstrations of some simple examples analyzed using a limited set of templates. As we can see, hierarchical structural analysis of sentences has indeed been preliminarily realized.

+---> (鸡蛋)可以(吃)吗
| +---> 鸡蛋
| | +---> 鸡蛋
| +---> 可以
| +---> 吃
| | +---> 吃
| +---> 吗

+---> (牛肉鸡蛋)可以(吃)吗
| +---> 牛肉鸡蛋
| | +---> 牛肉
| | +---> 鸡蛋
| +---> 可以
| +---> 吃
| | +---> 吃
| +---> 吗

+---> (苹果)的(颜色)是(什么)呢
| +---> 苹果
| | +---> 苹果
| +---> 的
| +---> 颜色
| | +---> 颜色
| +---> 是
| +---> 什么
| | +---> 什么
| +---> 呢

+---> (雪梨和苹果和香蕉)的(颜色)是(什么)呢
| +---> (雪梨和苹果)和(香蕉)
| | +---> (雪梨)和(苹果)
| | | +---> 雪梨
| | | | +---> 雪梨
| | | +---> 和
| | | +---> 苹果
| | | | +---> 苹果
| | +---> 和
| | +---> 香蕉
| | | +---> 香蕉
| +---> 的
| +---> 颜色
| | +---> 颜色
| +---> 是
| +---> 什么
| | +---> 什么
| +---> 呢

Of course, we cannot only report good news; here are some failed examples:

+---> (我的美味)的(苹果的颜色)是(什么)呢
| +---> (我)的(美味)
| | +---> 我
| | | +---> 我
| | +---> 的
| | +---> 美味
| | | +---> 美味
| +---> 的
| +---> (苹果)的(颜色)
| | +---> 苹果
| | | +---> 苹果
| | +---> 的
| | +---> 颜色
| | | +---> 颜色
| +---> 是
| +---> 什么
| | +---> 什么
| +---> 呢

+---> (苹果)的(颜色)是(什么的意思是什么)呢
| +---> 苹果
| | +---> 苹果
| +---> 的
| +---> 颜色
| | +---> 颜色
| +---> 是
| +---> (什么)的(意思)是(什么)
| | +---> 什么
| | | +---> 什么
| | +---> 的
| | +---> 意思
| | | +---> 意思
| | +---> 是
| | +---> 什么
| | | +---> 什么
| +---> 呢

We will analyze the failed examples later.

Summary

For those feeling completely confused or having many things to complain about, please read this section first~

Jigsaw Puzzle

From words to phrases to sentence templates, we have been playing a jigsaw puzzle: as we piece things together, we find that these two pieces fit well, so we merge them. Because merging items with high mutual information as a whole helps reduce the overall information entropy, which in turn reduces the overall learning difficulty.

If sentence templates don't make sense in the context of Chinese, just look back at how we learned English in primary or middle school; back then, we probably learned many English sentence templates~

What Is It For?

"Sentence template" is a new concept proposed in this article, and using it to recognize language structures is a new attempt. Readers might ask: what is this for?

I think the best way to answer this is to quote Isaac Newton:

I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.

I cited this to show that the fundamental reason for this exploration is not for some practical purpose, but to purely explore the mysteries of natural language.

Of course, if at the same time the research results possess some application value, that would be even more perfect. From current results, such value may exist. In NLP, we face ever-changing sentences, but in fact, "sentence styles" are finite, which means sentence templates are also finite. If necessary, we can manually annotate the meanings of the placeholders in each sentence template, which would allow us to map sentence template structures to conventional syntactic descriptions. Decomposing (infinite) sentences through finite templates can make NLP scenarios more flexible.

Similar things might have appeared in traditional natural language processing, but the content described here is purely the result of unsupervised discovery, has a self-consistent theoretical description, and the preliminary results are acceptable. Therefore, it is worth further thinking about its application value.

Moving Forward with Difficulty

After reading this article, the reader's biggest feeling might be "completely confused": can it be simplified a bit more?

To answer this, it must be mentioned that more than a month has passed since the last article in this series, and this article is finally being published; this seems like a long time. In terms of form, this article is just a simple extension of the previous one: doesn't it just extend adjacent associations to non-adjacent ones?

Indeed, that is true in form. But to extend this idea to something with both theoretical and practical value is not simple or smooth. For example, in template generation, how to obtain all candidate templates without omission is a difficult problem; secondly, after obtaining sentence templates (whether automatically or manually), identifying the templates within a sentence is even more difficult. Both the theoretical thinking and the programming implementation involve many obstacles, requiring a clear grasp of tree structures and recursive programming. I also debugged for over half a month before finalizing the entire process, though it may not be complete.

So, it is perfectly normal to feel confused. Even after completing the work and writing this article, I still feel quite confused~

Ideas for Improvement

In the Result Display section, we also presented some failed examples. In fact, there might be even more.

We should look at this from two perspectives. On the one hand, we have successful examples; exploring pure unsupervised mining is worthwhile even if we only get a small portion of successful results. On the other hand, for failed examples, we need to think about the reasons for failure and consider solutions.

I believe the overall idea of sentence templates is correct, but the issue lies in the fact that we have not reached true semantic-level understanding. For example, in the first failed example, the result was

(the delicious of mine) of (the color of the apple) is (what) huh

Can we say this decomposition is completely wrong? No. Strictly speaking, this decomposition is not grammatically incorrect; it just doesn't align with semantics or our common sense. Therefore, the error lies not with the sentence templates themselves, but with the inability to fully integrate semantics to build templates.

Looking at current mainstream syntactic analysis work, whether supervised or unsupervised, they basically combine "part-of-speech" to complete the analysis. This provides us with a direction: the next step in the minimum entropy series is to explore the problem of word clustering to better capture word meaning and language commonality.