The Minimum Entropy Principle (V): "Step by Step" Community Detection and Clustering

By 苏剑林 | October 19, 2019

Let us tirelessly review: the Minimum Entropy Principle is an unsupervised learning principle. "Entropy" represents the learning cost, and reducing this cost is our constant pursuit. By "minimizing the learning cost," we can unsupervisedly learn results that align with our intuition. This is the basic philosophy of the Minimum Entropy Principle.

In this article, we will introduce a quite beautiful clustering algorithm that also embodies the Minimum Entropy Principle, or can be derived from it. It is called InfoMap, or MapEquation. In fact, InfoMap is a result from 2007, with the earliest paper being "Maps of random walks on complex networks reveal community structure". Although it may seem old, I believe it remains the most beautiful clustering algorithm because it doesn't just tell us "how to cluster," but more importantly, it provides an elegant information-theoretic explanation of "why we cluster," and directly derives the entire clustering process from this explanation.

Complex directed network schematic
A schematic diagram of a complex directed graph network. Image from the original InfoMap paper "Maps of random walks on complex networks reveal community structure".

Of course, its positioning is not limited to clustering; more accurately, it is a "Community Detection" algorithm on graph networks. Community Detection roughly means given a directed/undirected graph network, find the "clumping" or "grouping" behavior within it. For detailed meaning, you can search for it yourself. Simply put, it is similar to clustering but has a richer meaning (can also refer to "What is Community Detection?").

Entropy and Coding

In previous articles, we used the concept of information entropy. Starting from this article, we introduce the equivalent concept of information entropy—average code length. Introducing it helps us more precisely understand and construct the goal of minimum entropy.

Binary Tree Coding

Binary Tree Coding
Each code is not a prefix of another code, so these codes can form a binary tree.

Coding means using a combination of a finite number of markers to represent original info. The most typical is binary coding, which uses only digits 0 and 1. Codes and original objects usually have a one-to-one correspondence. For example, "Science" might correspond to (11, 1001).

Note that we consider static coding here, meaning each object corresponds to one code. We also consider prefix-free codes (delimiter-free), meaning no additional separator is needed to decode individual objects. This simply requires that no code is a prefix of any other code (so that we can read code continuously until an object is identified, then start a new read). This implies that all codes can be constructed into a binary tree where each code corresponds to a leaf, as shown in the figure.

On this basis, we can get an interesting conclusion: assuming $n$ different characters with code lengths $l_1, l_2, \dots, l_n$, we have:

\begin{equation} \sum_{i=1}^n 2^{-l_i}\leq 1 \label{eq:leq} \end{equation}

This is a direct corollary of this binary tree representation; readers can try to prove it themselves.

Shortest Coding Length

Imagine a "shorthand" scenario where we need to quickly record text. Encoding involves mapping each character to a binary code. To record faster, we obviously want frequently occurring characters to have shorter codes (e.g., the word "the" in English). If you used a long string like 11000010001 for "the", recording would be slow. Using something like 10 would be much faster.

Suppose there are $n$ different characters with occurrence probabilities $P=(p_1, p_2, \dots, p_n)$. Two questions arise: First, how to find the optimal coding scheme for the shortest average length? Second, what is the theoretical minimum average length?

For the first question, the answer is Huffman coding (the same as in Huffman Softmax from Word2Vec), though that's not our focus. The answer to the second question is a fundamental result in information theory: Information Entropy.

\begin{equation} H(P)=-\sum_{i=1}^n p_i \log_2 p_i \label{eq:l} \end{equation}

That is, information entropy is the theoretical shortest average code length. This is a "powerful" result; it tells us that no matter what coding you use, the average code length cannot be lower than \eqref{eq:l}.

In the delimiter-free coding scenario mentioned above, we give a simple proof of \eqref{eq:l}. Let the lengths be $l_1, l_2, \dots, l_n$. The average length is:

\begin{equation} \sum_{i=1}^n p_i l_i = -\sum_{i=1}^n p_i \log_2 2^{-l_i} \label{eq:avg-l} \end{equation}

Because we have inequality \eqref{eq:leq}, define:

\begin{equation} \hat{p} = 1 - \sum_{i=1}^n 2^{-l_i}\geq 0 \end{equation}

Then Equation \eqref{eq:avg-l} can be written as:

\begin{equation} - 0\times \log \hat{p}-\sum_{i=1}^n p_i \log_2 2^{-l_i} \end{equation}

This is the cross-entropy of distributions $P=(0, p_1, p_2, \dots, p_n)$ and $Q=(\hat{p}, 2^{-l_1}, 2^{-l_2}, \dots, 2^{-l_n})$. Since cross-entropy is minimized when $P=Q$ (proven via $KL(P\|Q) \geq 0$), we have:

\begin{equation} \sum_{i=1}^n p_i l_i \geq -\sum_{i=1}^n p_i \log_2 p_i \end{equation}

Even a brief contact with information theory reveals this conclusion. Minimum entropy involves finding better schemes to approach this limit, which is also the limit for lossless compression.

InfoMap

Back to InfoMap, it is directly related to compression coding. Until recently (October 15, 2019), I finally understood its origins. The difficulty was partly due to unfamiliarity with information theory and partly because the original papers perhaps didn't clarify the bottlenecks for non-specialist readers.

InfoMap Publication List: https://www.mapequation.org/publications.html

Categorized Memory

If we are asked to memorize a sequence in a short time:

Pear, Grape, Banana, Guangzhou, Shanghai, Beijing, Hangzhou, Shenzhen, 123, 654, 798, 963

The sequence isn't long, but to form memories quickly, you would likely group them:

1. The first 3 are Fruits; the middle 5 are Cities; the last 4 are Arabic Numbers;
2. The Fruits are Pear, Grape, Banana;
3. The Cities are Guangzhou, Shanghai, Beijing, Hangzhou, Shenzhen;
4. The Numbers are 123, 654, 798, 963.

Memory efficiency is better using categorized grouping. The mathematical description of "improving efficiency" or "saving effort" is the reduction of entropy. Thus, a good classification scheme satisfies the minimum entropy principle. This is the core optimization goal of InfoMap: seeking the optimal clustering scheme by minimizing entropy.

Hierarchical Coding: Concept

As established, information entropy equals the shortest code length. Since categorized memory is efficient, it must correspond to some compression coding: Hierarchical Coding.

In hierarchical coding, we no longer use a single code for an object but a combination of two: the first code represents the object's category, and the second represents its index within the category. If objects of the same category often appear "clumped," hierarchical coding can compress the data.

How does hierarchical coding compress? Most InfoMap resources do not highlight the actual coding method. I believe the process is as follows:

Hierarchical Coding method
Hierarchical coding method: Insert a category marker before words of the same category and an exit marker at the end of the category block.

To remain lossless, we insert category tags and exit tags. Category tags use one set of codes; in-category objects and exit tags use another set. Since categories provide separation, objects across different categories can share the same code (e.g., Pear and Shanghai could both be 001). This reduces average code length. When decoding, you read the category code, switch to that category's set, read until an exit marker appears, and repeat.

Hierarchical Coding: Calculation

Now we need a quantitative calculation. Assume we have a hierarchical coding scheme; we calculate the average length using these definitions:

NotationMeaning
$p_{\alpha}$Probability of object $\alpha$ appearing
$q_{i\curvearrowright}$Probability of category $i$ appearing

According to our convention, every category sequence entry ends with an exit tag. Thus $q_{i\curvearrowright}$ is also the occurrence probability of category $i$'s exit tag. Note that these are globally normalized probabilities:

\begin{equation} \underbrace{\sum_i q_{i\curvearrowright}}_{\text{Sum of cat. probabilities}} + \underbrace{\sum_{\alpha}p_{\alpha}}_{\text{Sum of object probabilities}} + \underbrace{\sum_i q_{i\curvearrowright}}_{\text{Sum of exit tag probabilities}} = 1 \label{eq:guiyi} \end{equation}

Because categories and within-category objects use different coding sets, we calculate them separately. The shortest average code length for categories is:

\begin{equation} H(\mathcal{Q})=-\sum_i \frac{q_{i\curvearrowright}}{q_{\curvearrowright}} \log \frac{q_{i\curvearrowright}}{q_{\curvearrowright}} \end{equation}

where $q_{\curvearrowright} = \sum_i q_{i\curvearrowright}$. This results from normalizing category probabilities and applying the entropy formula.

Similarly, for category $i$, the within-category shortest average length (including exit tags) is:

\begin{equation} H(\mathcal{P}^i)=- \frac{q_{i\curvearrowright}}{p_{i\circlearrowright}} \log \frac{q_{i\curvearrowright}}{p_{i\circlearrowright}} -\sum_{\alpha\in i} \frac{p_{\alpha}}{p_{i\circlearrowright}} \log \frac{p_{\alpha}}{p_{i\circlearrowright}} \end{equation}

where $p_{i\circlearrowright} = q_{i\curvearrowright} + \sum_{\alpha\in i} p_{\alpha}$. (Note the difference between $p, q, \curvearrowright$ and $\circlearrowright$).

The total weighted average code length is:

\begin{equation} L(M) = \color{red}{q_{\curvearrowright}H(\mathcal{Q})} + \color{skyblue}{\sum_i p_{i\circlearrowright} H(\mathcal{P}^i)} \label{eq:loss} \end{equation}

With this index \eqref{eq:loss}, we can optimize. Given an object sequence, we seek a clustering scheme that minimizes $L(M)$. This algorithm has almost no hyperparameters—no need to pre-specify the number of clusters.

Random Walk

We have an objective \eqref{eq:loss}, but classic clustering doesn't provide a sequence. It usually gives similarities between samples. InfoMap abstracts the samples as a directed graph where edges $(\alpha, \beta)$ represent transition probabilities $p_{\beta\to\alpha}$ (normalized similarities).

The classic idea of a "Random Walk" emerges: start at point $j$, jump to $i$ with probability $p(i|j)$, and repeat to generate a long sequence. With this sequence, we can apply \eqref{eq:loss}.

InfoMap coding and clustering process
InfoMap coding and clustering process. (a) Random walk; (b) Building Huffman codes from probabilities; (c) Hierarchical coding; (d) Category coding inside hierarchical coding. Below, sequences show hierarchical coding is shorter.

The InfoMap process: Construct transition probabilities, perform random walks on the graph to generate a sequence, apply hierarchical coding to the sequence, and minimize \eqref{eq:loss} to complete clustering.

InfoMap Mathematical Details

We don't actually need to simulate the walk. We only need the steady-state probabilities $p_{\alpha}$ by solving:

\begin{equation} \begin{pmatrix}p_1\\p_2\\ \vdots \\p_n\end{pmatrix} = \begin{pmatrix}p_{1\to 1} & p_{2\to 1} & \cdots & p_{n\to 1}\\p_{1\to 2} & p_{2\to 2} & \cdots & p_{n\to 2} \\ \vdots& \vdots& \ddots & \vdots \\p_{1\to n} & p_{2\to n} & \cdots & p_{n\to n}\end{pmatrix} \begin{pmatrix}p_1\\p_2\\ \vdots \\p_n\end{pmatrix} \label{eq:pab} \end{equation}

or $p_{\beta}=\sum_{\alpha} p_{\alpha}p_{\alpha\to\beta}$. To ensure a unique solution and avoid getting stuck in isolated areas, we introduce "teleportation probability" $\tau$ (with probability $1-\tau$ follow the graph, with probability $\tau$ jump to any node). The equation becomes:

\begin{equation} p_{\beta}=(1-\tau)\sum_{\alpha} p_{\alpha}p_{\alpha\to\beta} + \tau \sum_{\alpha} \frac{p_{\alpha}}{n} \label{eq:pa} \end{equation}

Authors set $\tau=0.15$ by default. Now for $q_{i\curvearrowright}$, the exit probability from category $i$, it is the probability of moving from category $i$ to any node outside $i$:

\begin{equation} q_{i\curvearrowright} = \sum_{\alpha\in i}\sum_{\beta\not\in i} p_{\alpha} p_{\alpha\to\beta} \end{equation}

With teleportation:

\begin{equation} \begin{aligned} q_{i\curvearrowright} =& \sum_{\alpha\in i}\sum_{\beta\not\in i} p_{\alpha} \left[(1-\tau)p_{\alpha\to\beta} + \frac{\tau}{n}\right]\\ =& \tau\frac{n - n_i}{n}\sum_{\alpha\in i}p_{\alpha} + (1-\tau)\sum_{\alpha\in i}\sum_{\beta\not\in i} p_{\alpha}p_{\alpha\to\beta} \end{aligned} \label{eq:q-exit} \end{equation}

where $n_i$ is the number of nodes in category $i$.

The InfoMap workflow:

  1. Define transition probabilities $p_{\alpha\to\beta}$;
  2. Numerically solve \eqref{eq:pa} for $p_{\alpha}$;
  3. Search for a clustering scheme to minimize \eqref{eq:loss}.

Extension Ideas

InfoMap is easily extended: multilevel hierarchical coding (clusters of clusters) by iterating the loss function, and overlapping community detection (one node in multiple clusters). Allowing a node to belong to multiple categories adds redundancy to the node's code but may reduce exit/category marker codes, lowering the total average length. Refer to "Multilevel compression..." and "Compression of Flow Can Reveal Overlapping-Module...".

Experiments

Solving Algorithm

The original search was greedy search plus simulated annealing. Each node starts in its own cluster, then pairs are merged to maximize the drop in \eqref{eq:loss}. An improved greedy algorithm followed, making it fast enough for networks with millions of nodes and edges.

Installation

Implemented in C++ with Python and R wrappers. Official site: https://www.mapequation.org/code.html.

Word Clustering

Example using Word2Vec similarities: word_cluster.py. Results show coherent groups like family members (Sister, Father, Aunt), corporate finance (Stock, Securities, Loan, CEO), and school/education terms.

Summary

InfoMap is a clustering/community detection algorithm based on transition probabilities with a beautiful information-theoretic (Minimum Entropy) explanation. It has few hyperparameters and is widely applicable for finding structural modules in any domain containing nodes and graph connections. Its elegance makes it worth exploring.