SVD Decomposition (II): Why Does SVD Imply Clustering?

By 苏剑林 | Jan 26, 2017

Wishing all readers a Happy New Year in advance, and good luck for 2017!

This article mainly aims to answer two "why" questions: 1. Why did I become interested in SVD? 2. Why do I say that SVD is a clustering process? The content of these answers is purely the result of my personal reflection and lacks formal references for now.

Why Research SVD?

Since I first encountered deep learning in 2015, I have been researching it for nearly two years. Nowadays, concepts like deep learning and data science are blooming everywhere. Why, in an era where deep learning is so popular, have I gone back to study the "ancient" SVD decomposition? I believe that as a matrix decomposition algorithm, the value of SVD is not only reflected in its wide range of applications but also in its deeper internal meaning—specifically, its interpretability. Even today, many people still feel that deep learning (neural networks) is just an effective "black box" model. However, simply using the words "black box" to explain the effectiveness of deep learning is clearly unsatisfying. As mentioned previously, SVD decomposition is essentially equivalent to a three-layer autoencoder without activation functions. Understanding SVD decomposition can help seek a reasonable probabilistic explanation for neural network models.

Recently, when trying to build more complex models, such as question-answering systems and chatbots, I have increasingly felt that at the very beginning, the deep learning models like seq2seq—which have been making such headlines—are virtually unusable. I essentially started from the most basic probability model $P(A|Q)$, simplified it step by step, and finally obtained a model with acceptable complexity. Models obtained this way have clear meaning and strong controllability. However, part of this is based on statistics; pure statistics cannot yield truly "intelligent" results. Yet, as mentioned before, SVD decomposition can bring a preliminary level of intelligence on top of statistical results. This gives me a strong sense that one, the interpretability of a model—especially its probabilistic explanation—is very important, and two, after better understanding SVD, I have a much better feel for the meaning and application of neural network models.

How Does SVD Decomposition Cluster?

Why is SVD decomposition clustering? In fact, it involves a very simple probabilistic model.

Given a matrix $M_{m \times n}$, without loss of generality, assume all its elements are non-negative. We can then perform normalization on each row. In this way, the resulting matrix can represent a transition probability:

$$P(B|A)=\begin{pmatrix}p(b_1|a_1) & p(b_2|a_1) & \dots & p(b_n|a_1)\\ p(b_1|a_2) & p(b_2|a_2) & \dots & p(b_n|a_2)\\ \vdots & \vdots & \ddots & \vdots\\ p(b_1|a_m) & p(b_2|a_m) & \dots & p(b_n|a_m)\end{pmatrix}$$

The normalization condition is:

$$\sum_{j=1}^n p(b_j|a_i)=1, \quad i=1,2,\dots,m$$

The term $p(b_j|a_i)$ represents the probability of $b_j$ following $a_i$. This kind of probability model is very common, such as in bigram language models.

Now, let's assume that each $a_i$ can be clustered into $l$ categories, denoted as $c_1, c_2, \dots, c_l$; and each $b_i$ can be clustered into $k$ categories, denoted as $d_1, d_2, \dots, d_k$. If we want to study the patterns of $b_j$ following $a_i$, it can actually be simplified to the patterns between categories. (A typical small case is: we classify words into verbs, nouns, adjectives, etc., and then find that nouns can follow verbs to form a phrase; "Verb + Noun" is a clustering rule discovered by our brains.) This is the sole hypothesis of SVD decomposition. More clearly, the assumptions include:

1. Both $a_i$ and $b_i$ can be clustered into several categories;
2. The connection rules between $a_i$ and $b_i$ can be simplified to the connection rules between the categories they belong to.

According to the probability formula, we then get:

$$p(b_j|a_i) = \sum_{k,l}p(b_j|d_k)p(d_k|c_l)p(c_l|a_i)$$

Each term has a very clear probabilistic meaning:

$p(c_l|a_i)$ is the probability that $a_i$ manifests as category $c_l$;

$p(d_k|c_l)$ is the probability that category $d_k$ follows category $c_l$;

$p(b_j|d_k)$ is the probability that the element is $b_j$, given the category is $d_k$.

Thus, naturally we have $p(b_j|a_i) = \sum_{k,l}p(b_j|d_k)p(d_k|c_l)p(c_l|a_i)$. In other words, as long as the assumptions hold, this formula holds exactly. And this operation is precisely the product of three matrices:

$$P(B|A)=P(B|D)\times P(D|C)\times P(C|A)$$

That is to say, a matrix is decomposed into the product of three lower-dimensional matrices—is this not exactly SVD decomposition? Of course, a subtle difference is that if it were a probabilistic decomposition, normalization constraints would be required (this part of the content belongs to the pLSA model in topic modeling). While SVD itself does not require normalization constraints, it does not affect the essential idea: that matrix decomposition inherently contains the meaning of clustering.

In this way, we use matrix decomposition to cluster rows and columns. We do not need to tell the computer which categories to cluster into (for example, we don't need to tell the computer to divide words into nouns, verbs, adjectives, etc.); instead, it is completed directly by matrix decomposition. (Imagine if you just shouted through a megaphone, "Assemble! Time to cluster!" and everyone automatically sorted themselves into groups without you telling them how.) Or to put it another way, through a probabilistic model, we have endowed SVD decomposition with the meaning of clustering.

Happy New Year

Well... I felt like this was something that could be explained in one or two sentences, yet I managed to drag it out into so many words. I hope readers don't find me too verbose ^_^

Once again, I wish everyone a Happy New Year. Every year I say the same thing: I hope everyone continues to support me!