By 苏剑林 | April 18, 2018
In today's world, where end-to-end schemes like deep learning have gradually swept through NLP, are you still willing to think about the fundamental principles behind natural language? We often use the term "text mining"—do you truly feel the sense of "mining"?
A while ago, I read an article on unsupervised syntactic analysis. Subsequently, from its references, I discovered the paper "Redundancy Reduction as a Strategy for Unsupervised Learning". This paper describes how to recover English words from English text with spaces removed. When applied to Chinese, isn't this exactly the construction of a lexicon? So, I read it with great interest and found the paper's reasoning clear, the theory complete, and the results beautiful; it was a delight to read.
Although the value of this paper may not seem very high today, and its results might have already been studied by many, it is important to note: this is a paper from 1993! In an era before PCs became popular, such forward-looking research was conducted. While deep learning is popular now and NLP tasks are becoming increasingly complex—which is indeed a great step forward—our understanding of the true principles of NLP may not have surpassed our predecessors from decades ago by much.
This paper implements unsupervised lexicon construction through "redundancy reduction." From an information theory perspective, "redundancy reduction" is the minimization of information entropy. The article on unsupervised syntactic analysis also pointed out that "information entropy minimization is the only viable scheme for unsupervised NLP." I further studied some related materials and combined them with my own understanding. I found this comment to be quite thought-provoking. I feel that not only for NLP, but information entropy minimization is likely the root of all unsupervised learning.
Readers may have heard of the Maximum Entropy Principle and the Maximum Entropy Model. What is this Minimum Entropy Principle? Is it not contradictory to Maximum Entropy?
We know that entropy is a measure of uncertainty. The Maximum Entropy Principle means that when we make inferences about results, we must acknowledge our ignorance; therefore, we should maximize uncertainty to obtain the most objective results. As for the Minimum Entropy Principle, we have two perspectives of understanding:
1. An intuitive understanding: The process of civilizational evolution is always a process of exploration and discovery. Through our efforts, more and more things change from uncertain to certain, and entropy gradually tends toward minimization. Therefore, to discover hidden patterns from a pile of raw data (to reenact civilization), we must see whether these patterns help reduce the overall information entropy, because this represents the direction of civilizational evolution. This is the "Minimum Entropy Principle."
2. A more rigorous understanding: "Knowledge" has an inherent information entropy, representing its essential information content. But before we fully understand it, there are always unknown factors, which cause us to have redundancy when expressing it. Therefore, estimating information entropy according to our current understanding actually yields an upper bound on the inherent information entropy. Minimizing information entropy means finding ways to lower this upper bound, which implies reducing unknowns and approaching the inherent information entropy.
Thus, following the path of the "Minimum Entropy Principle," I have re-organized previous works and made some new extensions, resulting in these texts. Readers will gradually see that the Minimum Entropy Principle can be used in a highly explanatory and enlightening way to derive rich results.
Let us begin by examining the information entropy of language and slowly enter the world of this Minimum Entropy Principle~
From "Can't Afford 'Entropy': From Entropy, Maximum Entropy Principle to Maximum Entropy Model (I)", we know that the information entropy of an object is proportional to the negative logarithm of its probability, which is: \[I(c)\sim -\log p_c\tag{1.1}\] If we consider the character to be the basic unit of Chinese, then Chinese is a combination of characters. $p_c$ represents the probability of the corresponding character, and $-\log p_c$ is the amount of information in that character. We can estimate the average information of each Chinese character through a large corpus: \[\mathcal{H}_c = -\sum_{c\in\text{Chinese Characters}} p_c\log p_c\tag{1.2}\] If the $\log$ is base 2, then according to data circulating online, this value is about 9.65 bits (I have counted some articles myself and obtained a value of about 9.5; the two are comparable). Similarly, the average information of each letter in English is about 4.03 bits.
What does this number mean? Generally, it can be considered that the speed at which we receive or memorize information is fixed. Therefore, the size of this information volume actually corresponds to the time we need to receive this information (or the effort spent, etc.). Thus, we can say this number represents the difficulty of learning this thing (memory load). For example, assuming we can only receive 1 bit of information per second, then memorizing an 800-character article character by character would require $9.65 \times 800$ seconds.
Since the information entropy of a single Chinese character is 9.65, while the information entropy of an English letter is only 4.03, does this mean English is a more efficient form of expression?
Obviously, such a conclusion cannot be made so rashly. Is it necessarily easier to memorize an English essay than a Chinese essay?
For example, if an 800-character Chinese essay is translated into English, it might have 500 words. If each English word has an average of 4 letters, then the total information is $4.03 \times 500 \times 4 \approx 9.65 \times 800$. It can be seen that they are roughly equivalent. In other words, comparing the information volume of different language units is meaningless; what is meaningful is the total volume of information—that is, which is more concise when describing the same meaning.
When the meaning of two sentences is the same, the inherent information volume of this "meaning" is constant. However, when expressed in different languages, "redundancy" is inevitably introduced. Therefore, the information volume expressed in different languages varies. This information volume is actually equivalent to the memory load; the more redundant a language is, the larger its information volume and the greater the memory load. It is just like teaching the same course: some teachers teach clearly and concisely, and students understand easily; some teachers are wordy and repetitive, and students struggle to learn. For exactly the same course, the amount of knowledge is essentially the same, but a poorly taught lesson indicates that too much irrelevant information was introduced during the teaching process—this is "redundancy," so we must find ways to "reduce redundancy."
The estimated results for Chinese and English mentioned above are comparable, indicating that both Chinese and English have undergone long-term optimization and have both reached a relatively optimized state; there is no clear statement that one is significantly superior to the other.
Note that in the above estimations, we emphasized "memorizing character by character." Perhaps we are too familiar with Chinese to realize what this implies; in fact, it represents a very mechanical method of memorization, which is not how we actually do it.
Recalling the scenes of our childhood when we recited ancient poems and classical Chinese, initially we recited them without understanding, swallowing them whole—where we knew every character but didn't know what they meant when strung together. This is the so-called "reading character by character." Obviously, such memorization is very difficult. Later, we slowly began to figure out the patterns of classical Chinese writing and could gradually understand the meanings of or ancient poems or prose, and the difficulty of memorization would decrease. In high school, we also learned grammatical rules such as "object fronting" and "postpositional modifiers" in classical Chinese, which were very helpful for our memory and understanding.
Here is the key point!
From the example of classical Chinese, we can see that memorizing word for word like chanting a scripture is very difficult; it becomes easier after grouping words for understanding, and even easier if we can find certain grammatical rules. But the speed at which we receive (memorize) information is still fixed. This means that steps like word segmentation and grammar have reduced the information volume of the language, thereby reducing our learning cost!
Thinking further, it's not just language; learning anything is like this. If there is only a small amount of content to learn, we can just memorize it by brute force. But when there are many things to learn, we try to find the "patterns" (relying on routines). For example, in Chinese chess, it is divided into the opening, middle-game, and endgame, and each stage has many "standard patterns" (book moves), which are used to reduce the learning difficulty for beginners and serve as the basis for adaptation in complex situations. Another example is that we have "The Art of War" and "The Thirty-Six Stratagems"—these are "manuals of patterns." By mining "patterns" to alleviate the burden of individual memorization, the emergence of a "pattern" is a process of reducing the volume of information.
To put it simply, if you chant a scripture enough times, you can discover the patterns within it.
In a nutshell, the rate at which we receive information is fixed, so the only way to accelerate our learning progress is to reduce the redundant information of the learning target. The so-called "discarding the dross and keeping the essence" is the Minimum Entropy Principle in NLP, which is the "redundancy reduction" mentioned at the beginning. We can understand it as "saving unnecessary learning costs."
In fact, an efficient learning process must reflect this idea. Similarly, teachers design their teaching plans based on this idea. When teaching, teachers are more inclined to teach "general methods" (even if they have more steps) rather than choosing to teach a unique and clever solution for every single problem. When preparing for the GaoKao (college entrance exam), we work hard to figure out various question patterns and problem-solving patterns. These are all processes of reducing information entropy through mining "patterns," thereby lowering learning costs. "Patterns" are the methods for "redundancy reduction."
A "pattern" is a "fixed form." Only with enough patterns can we respond to all changes with the unchanging. As the saying goes, "ten thousand changes do not leave the source"—this "source" must be the pattern. When there are too many "patterns," we further look for "meta-patterns"—patterns of patterns—to reduce our burden of memorizing patterns. This is a progressive process. It seems that elevating individual phenomena into patterns is precisely the embodiment of human intelligence~
Well, enough empty talk. Next, we will formally embark on the journey of mastering "patterns."