SVD Decomposition (Part 1): Autoencoders and Artificial Intelligence

By 苏剑林 | January 15, 2017

At first glance, SVD decomposition appears to be a traditional data mining technique, while autoencoders are a relatively "advanced" concept in deep learning; one might assume they have little in common. However, this article aims to show that if we ignore activation functions, the two are equivalent. Further reflection reveals that whether it is SVD or an autoencoder, we perform dimensionality reduction not purely to reduce storage or computation, but as a preliminary manifestation of "intelligence."

Equivalence

Suppose we have a massive matrix $M_{m\times n}$ with $m$ rows and $n$ columns. Performing calculations or even storing this matrix can be problematic. Therefore, we consider a decomposition, hoping to find matrices $A_{m\times k}$ and $B_{k\times n}$ such that: $$M_{m\times n}=A_{m\times k}\times B_{k\times n}$$ where the multiplication is matrix multiplication. As shown below:

SVD

In this way, what originally had $mn$ elements is now transformed into $(m+n)k$ elements on the right side. If $k$ is small enough, then $(m+n)k < mn$, which reduces both storage and computational requirements. Of course, if $k$ is sufficiently small, the equality usually does not hold strictly; it holds under the condition of minimizing the error. We will not make a strict distinction here. This is the core of the SVD decomposition described in this article. If the matrix $M_{m\times n}$ is originally sparse (such as a document matrix represented by one-hot encoding or a user rating matrix), $k$ can typically be several orders of magnitude smaller while keeping accuracy essentially unchanged. One of the uses of SVD is for topic modeling, recommendation systems, etc. (Note: SVD may be defined differently in some contexts, but at least this is its fundamental purpose.)

So, what about autoencoders? Here, let's ignore activation functions for a moment and look only at the linear structure. An autoencoder aims to train an identity function $f(x)=x$, but with a compression at the hidden layer.

Autoencoder Network Structure

Mathematically, we want to find matrices $C_{n\times k}$ and $D_{k\times n}$ such that: $$M_{m\times n} \approx M_{m\times n} \times C_{n\times k}\times D_{k\times n}$$ The equality here is not strictly equal but is optimal in some sense. The diagram is as follows:

Autoencoder

Consequently, if we let: $$A_{m\times k} = M_{m\times n} \times C_{n\times k},\quad B_{k\times n} = D_{k\times n}$$ aren't the two equivalent? As long as they are optimized under the same error function (loss function), the resulting outcomes must be equivalent.

Thus, we have proven that a three-layer autoencoder without activation functions is actually equivalent to traditional SVD decomposition. Despite this equivalence, the autoencoder remains an innovation because it transforms matrix decomposition into a neural network compression and encoding problem, making it clearer and easier to understand. Furthermore, it allows for mini-batch training, unlike SVD which traditionally requires all data to be input at once (though one could conceive of a "batch SVD" algorithm by imitating neural networks, it might not be particularly meaningful).

Compression and Intelligence

While autoencoders are more intuitive regarding compression and encoding, the physical meaning of SVD is more explicit. This shows that looking at the same thing from different perspectives is very meaningful. The physical meaning of SVD can be observed if we consider document encoding, as shown in the figure below, where each column represents a word and each row represents an article. For example, there are five words A, B, C, D, E, and 6 articles.

Physical Meaning of SVD

As shown, we can view the physical meaning of SVD this way: for the first decomposed matrix, we can see it as a clustering of rows—that is, a simple clustering of documents. For the second decomposed matrix, we can see it as a clustering of columns—that is, a clustering of words. Matrix multiplication then signifies the matching between document clusters and word clusters. Readers might ask: why must words and documents be clustered into the same number of categories? Why not different numbers? In fact, a more reasonable formulation is: $$M_{m\times n}=A_{m\times k}\times P_{k\times l} \times Q_{l\times n}$$ This is equivalent to saying documents are clustered into $k$ categories, words are clustered into $l$ categories, and $P_{k\times l}$ is the matching matrix between document clusters and word clusters. Then, simply setting $B_{k\times n} = P_{k\times l} \times Q_{l\times n}$ returns us to the original form.

Why construct such a seemingly forced explanation? My purpose is to explain why "intelligence" is manifested in artificial models. We know that humans have memory, but after reaching a certain level of memory, we are often not satisfied with mere rote memorization. Instead, we find the patterns in what needs to be remembered and memorize through patterns. For instance, in language, we categorize words into verbs, nouns, etc., and then we discover patterns like "verb + noun" usually forming a phrase. We may not always be able to articulate these patterns, but it is undeniable that our brains are undergoing such processes of "pattern finding" and "categorization." From a data mining perspective, I believe this is a process of compression and clustering. That is, by reconstructing data after compression, we can mine the commonalities of the data—what we call patterns—and obtain more generalized results.

This may answer certain questions, such as: why do deep learning-based word segmentation models (whether LSTM or CNN) have such good recognition capabilities for out-of-vocabulary (OOV) words? It is because they contain the structure of an autoencoder, which is equivalent to SVD. As we explained earlier, SVD carries the physical meaning of clustering (pattern finding), making it more generalized and higher performing.

Deriving New Words

Too abstract? Can't follow? Don't worry, here is an example. This example involves our understanding of words. Please look at the following table: $$\begin{array}{c|c|c|c|c|c} \hline & \text{Beast (兽)} & \text{World (界)} & \text{Dragon (龙)} & \text{Spirit (灵)} & \text{Art/Technique (术)} \\ \hline \text{God (神)} & 1 & 1 & 1 & 1 & 0 \\ \hline \text{Magic/Demon (魔)} & 1 & 1 & 1 & 1 & 1 \\ \hline \end{array}$$

Assume these are statistics from a science fiction novel. They represent that terms like "God-Beast," "God-World," "God-Dragon," "God-Spirit," "Magic-Beast," "Magic-World," "Magic-Dragon," "Magic-Spirit," and "Magic-Art" appeared in the novel with a frequency of 1, but the term "God-Art" (神术) never appeared. Would we then infer: "Magic" (魔) and "God" (神) seem quite similar in usage, so could "God-Art" be a valid term? How do we express this mathematically?

Surprisingly, one way to do this is SVD! In other words, we should perform some clustering on the usage of the first characters (perhaps as adjectives, verbs, or nouns?) and clustering on the usage of the second characters, and then consider the pairings between clusters. Of course, we don't actually need to tell the computer to cluster them as nouns or verbs; computers do not need to understand language in the way humans do. We only need to tell the computer to perform clustering, and since SVD inherently implies clustering, we just need to perform SVD. The specific experiment is as follows:

Select all two-character words with a frequency of at least 100 from the Jieba segmentation dictionary. Then, use the first character as the row, the second character as the column, and the frequency as the value (if the word does not exist, mark it as 0) to obtain a matrix. Perform SVD decomposition on this matrix to get two matrices, then multiply them back together (reconstruction). Treat the reconstructed matrix values as new frequencies and compare them to the original matrix to see which high-frequency words have emerged that were not there before.

As a result, I obtained some words that were not in the input dictionary, such as (the following is just a small sample):

龙脑 (Dragon-Brain) 10.271244
龙脚 (Dragon-Foot) 12.496673
龙腊 (Dragon-Wax) 16.860170
龙腿 (Dragon-Leg) 12.172765
龙莲 (Dragon-Lotus) 11.362767
龙蔗 (Dragon-Cane) 67.800909
龙薯 (Dragon-Potato) 30.580730
七讲 (Seven-Lectures) 11.439969
七评 (Seven-Reviews) 12.362163
七课 (Seven-Lessons) 11.587767
七郎 (Seven-Groom) 12.789438
七隐 (Seven-Hidden) 10.479609
七页 (Seven-Pages) 15.499356
七项 (Seven-Items) 18.802959
怨怒 (Resentment-Anger) 29.831461
怨恶 (Resentment-Evil) 15.875075
怨苦 (Resentment-Bitterness) 24.979326
怪兔 (Strange-Rabbit) 10.246062
怪奇 (Strange-Odd) 12.768122
怪孩 (Strange-Child) 14.391967
怪形 (Strange-Shape) 14.856068

Among these words, many are unreliable; in fact, there are far more unreliable ones than reliable ones. But that doesn't matter, because if we process natural language, those unreliable words would not appear in a real-world environment. This leads to a fascinating result: we perform SVD decomposition originally for purposes like compression and dimensionality reduction, yet after reconstruction, it derives richer results with a larger vocabulary. This appears to be the result of clustering and pattern finding. This is different from statistical-based new word discovery; these new words do not exist in the corpus but are derived through the construction patterns of existing words. This is a true process of inference—namely, "intelligence." To put it hyperbolically, SVD brought about a preliminary form of artificial intelligence (though it is certainly not the only method).

Activation Function

What on earth is an activation function? In the experiments above, I also discovered a concrete physical meaning for the activation function. When we decompose a matrix with SVD and then reconstruct it, some elements in the reconstructed matrix will be negative. However, the meaning of our matrix is frequency, and frequency cannot be negative. Therefore, we discard (truncate) the negative numbers and set them to 0.

Wait? Isn't this exactly the operation of the function $\max(x, 0)$? Isn't this the ReLU activation function that is most widely used? It turns out that the physical meaning of the activation function here is nothing more than the discarding (truncation) of irrelevant elements. This operation is something we have already been using in statistics for a long time (discarding negative numbers, discarding small statistical results, etc.).

Of course, in neural networks, activation functions have more profound meanings, but in shallow networks (matrix decomposition), the intuitive impression they give us is simply truncation.

Summary

This article centered on SVD to explore its inherent physical meaning (clustering), the connection between SVD decomposition and autoencoders, the intricate relationship between SVD and intelligence, and a related experiment. Most of this content is conceptual and geared towards understanding—deepening our grasp of models and artificial intelligence so that we have more "confidence" when applying models, rather than just blindly plugging into them.