Three Flavors of Capsule: Matrix Capsules with EM Routing

By 苏剑林 | March 02, 2018

In fact, shortly after the release of the paper "Dynamic Routing Between Capsules," a new Capsule paper, "Matrix Capsules with EM Routing," was already anonymously disclosed (during the ICLR 2018 blind review process). Now that the authors have been revealed, they are Geoffrey Hinton, Sara Sabour, and Nicholas Frosst. As expected, Hinton is indeed among them.

As everyone knows, results published by "founding father" figures like Hinton are generally quite "heavyweight." So, what are the characteristics of this new paper?

During my thinking process, the article "Understanding Matrix capsules with EM Routing" provided me with much inspiration, and the discussions by various experts on Zhihu also accelerated my reading. I would like to express my gratitude for those resources.

Paper Abstract

Let us first recall the diagram from the previous introduction, "Another New Year's Feast: From K-Means to Capsule":

A simple schematic of the Capsule framework

This diagram shows that Capsule actually describes a modeling framework. Many components within this framework can be customized, most notably the clustering algorithm. It can be said that "there are as many types of dynamic routing as there are clustering algorithms." So, what did Hinton change this time? Generally speaking, this new paper introduces the following new elements:

1. Capsules were previously represented by vectors; they are now represented by matrices.
2. The clustering algorithm has been replaced with GMM (Gaussian Mixture Model).
3. In the experimental section, a Capsule version of convolution is implemented.

A Wave of Questions

In fact, seeing these three new elements, readers might have many thoughts and questions, such as:

Vector vs Matrix

What is the difference between a matrix and a vector? Can't a matrix be flattened into a vector?

There is actually a difference. For example, what is the difference between a $4 \times 4$ matrix and a 16-dimensional vector? The answer is that the importance of elements at different positions in a matrix varies, whereas every element in a vector is treated with equal importance. Readers familiar with linear algebra may also feel that the diagonal elements of a matrix "seem" more significant than others. From a computational perspective, to transform a 16-dimensional vector into another 16-dimensional vector, we need a $16 \times 16$ transformation matrix; however, if we transform a $4 \times 4$ matrix into another $4 \times 4$ matrix, we only need a $4 \times 4$ transformation matrix, which reduces the number of parameters. From this perspective, the fundamental purpose of changing Capsules from vectors to matrices might be to reduce the computational cost.

Cubic Arrays?

Could future Capsules be "cubic arrays" or even higher-order tensors?

Unlikely. The multiplication of higher-order tensors is essentially the multiplication of second-order matrices.

GMM vs K-Means

Is GMM clustering much different from the K-Means clustering you mentioned before?

This needs to be viewed from two angles. On the one hand, GMM can be seen as an upgraded version of K-Means, and it is inherently differentiable, requiring none of the previous "softening" tricks. If Euclidean distance is used in K-Means, then K-Means is a limiting version of GMM. On the other hand, K-Means allows us to flexibly use other similarity metrics, while GMM essentially limits us to (weighted) Euclidean distance, "hard-coding" the metric. This could be considered a disadvantage. Overall, they are roughly on par with each other.

Capsule Version of Convolution

What is the Capsule version of convolution? Why wasn't it in the previous paper?

The dynamic routing we speak of is essentially equivalent only to a fully connected layer in deep learning, whereas a convolutional layer in deep learning is a local fully connected layer. Thus, by implementing "local dynamic routing," one obtains the Capsule version of convolution. This should have appeared in Hinton's previous paper as it has nothing to do with the specific routing algorithm, but for some reason, Hinton only implemented it in this new paper.

Introduction to the GMM Model

Since this new paper uses GMM for clustering, we must put in some effort to learn about GMM. Understanding the GMM algorithm is very interesting, even if it weren't for Capsules, because the GMM model significantly deepens our understanding of probabilistic models and machine learning theory (especially unsupervised learning theory). As a preview, in the following content, I have significantly simplified the derivation process of GMM, which I believe will lower the difficulty of understanding for readers.

Of course, readers who only want to understand the core idea of Capsule can selectively skip the more theoretical parts.

Essence

In fact, it is best not to view GMM merely as a clustering algorithm, but as a true unsupervised learning algorithm that attempts to learn the distribution of data. Data points are individuals, while the distribution is a whole; moving from studying the data themselves to studying the data distribution is a qualitative change.

GMM stands for Gaussian Mixture Model. Specifically, for existing vectors $\boldsymbol{x}_1, \dots, \boldsymbol{x}_n$, GMM hopes to find the distribution $p(\boldsymbol{x})$ they satisfy. Of course, one cannot search aimlessly; it must be organized into a relatively simple form. GMM assumes this batch of data can be divided into several parts (categories), and each part is studied individually, i.e.,

\[p(\boldsymbol{x})=\sum_{j=1}^k p(j)p(\boldsymbol{x}|j)\tag{1}\]

where $j$ represents the category, taking values $1, 2, \dots, k$. Since $p(j)$ has nothing to do with $\boldsymbol{x}$, it can be considered a constant distribution, denoted as $p(j)=\pi_j$. Then $p(\boldsymbol{x}|j)$ is the probability distribution within this category. The characteristic of GMM is using probability distributions to describe a category. What should we use? We use the simplest normal distribution. Note that since $\boldsymbol{x}$ is a vector, we must consider a multivariate normal distribution, generally in the form:

\[N(\boldsymbol{x};\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)=\frac{1}{(2\pi)^{d/2}(\det\boldsymbol{\Sigma}_j)^{1/2}}\exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_j)^{\top}\boldsymbol{\Sigma}_j^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_j)\right)\tag{2}\]

where $d$ is the number of components in vector $\boldsymbol{x}$. Now we have the basic form of the model:

\[\begin{aligned}p(\boldsymbol{x})=\sum_{j=1}^k p(&j)\times p(\boldsymbol{x}|j)=\sum_{j=1}^k\pi_j N(\boldsymbol{x};\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)\\ \downarrow& \qquad\,\,\downarrow\\ \pi_j& \,\,\cdot\,\, N(\boldsymbol{x};\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j) \end{aligned}\tag{3}\]

Solution

We have the model, but the unknown parameters are $\pi_j, \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j$. How do we determine them?

The ideal method is maximum likelihood estimation; however, it has no analytical solution, so it needs to be transformed into an EM process. Even then, the solution process is difficult to understand (involving derivatives of determinants). Here I provide a clear and simple derivation based on this fact: For a normal distribution, the result of maximum likelihood estimation is the same as the result of moment estimation for the first two moments. (Trust me, this should be one of the simplest GMM derivations you can find.)

Simply put, aren't $\boldsymbol{\mu}_j$ and $\boldsymbol{\Sigma}_j$ just the mean (vector) and (covariance) variance (matrix) of the normal distribution? Couldn't I just calculate the corresponding mean and variance directly from the samples? It's not that simple, because what we assume is a mixture model of normal distributions. If we calculate them directly, we only get the combined mean and variance, and cannot obtain the mean and variance of each category's normal distribution $p(\boldsymbol{x}|j)$.

However, we can transform it using Bayes' Theorem. First, we have:

\[p(j|\boldsymbol{x})=\frac{p(\boldsymbol{x}|j)p(j)}{p(\boldsymbol{x})}=\frac{\pi_j N(\boldsymbol{x};\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}{\sum_{j=1}^k\pi_j N(\boldsymbol{x};\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}\tag{4}\]

For example, for the mean vector, we have:

\[\boldsymbol{\mu}_j = \int p(\boldsymbol{x}|j)\boldsymbol{x}d\boldsymbol{x}=\int p(\boldsymbol{x}) \frac{p(j|\boldsymbol{x})}{p(j)}\boldsymbol{x}d\boldsymbol{x}=E\left[\frac{p(j|X)}{p(j)}X\right]\tag{5}\]

Here $E[]$ means taking the mathematical expectation over all samples. Then we can obtain:

\[\boldsymbol{\mu}_j = \frac{1}{n}\sum_{i=1}^n \frac{p(j|\boldsymbol{x}_i)}{p(j)}\boldsymbol{x}_i = \frac{1}{\pi_j n}\sum_{i=1}^np(j|\boldsymbol{x}_i)\boldsymbol{x}_i\tag{6}\]

where the expression for $p(j|\boldsymbol{x})$ is given in $(4)$. Similarly, for the covariance matrix, we have:

\[\boldsymbol{\Sigma}_j = \frac{1}{\pi_j n}\sum_{i=1}^n p(j|\boldsymbol{x}_i)(\boldsymbol{x}_i-\boldsymbol{\mu}_j)(\boldsymbol{x}_i-\boldsymbol{\mu}_j)^{\top}\tag{7}\]

Then:

\[\pi_j = p(j) = \int p(j|\boldsymbol{x})p(\boldsymbol{x})d\boldsymbol{x}=E\left[p(j|X)\right]\tag{8}\]

So:

\[\pi_j = \frac{1}{n}\sum_{i=1}^n p(j|\boldsymbol{x}_i)\tag{9}\]

In theory, we need to solve a massive system of equations formed by $(4), (6), (7), (9)$. However, this is difficult to operate, so we can solve iteratively, yielding the iterative algorithm:

\begin{equation} \text{EM Algorithm 1}:\left\{\begin{aligned} &p(j|\boldsymbol{x}_i) \leftarrow \frac{\pi_j N(\boldsymbol{x}_i;\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}{\sum_{j=1}^k\pi_j N(\boldsymbol{x}_i;\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}\\ &\boldsymbol{\mu}_j \leftarrow \frac{1}{\sum_{i=1}^n p(j|\boldsymbol{x}_i)}\sum_{i=1}^n p(j|\boldsymbol{x}_i)\boldsymbol{x}_i\\ &\boldsymbol{\Sigma}_j \leftarrow \frac{1}{\sum_{i=1}^n p(j|\boldsymbol{x}_i)}\sum_{i=1}^n p(j|\boldsymbol{x}_i)(\boldsymbol{x}_i-\boldsymbol{\mu}_j)(\boldsymbol{x}_i-\boldsymbol{\mu}_j)^{\top}\\ &\pi_j \leftarrow \frac{1}{n}\sum_{i=1}^n p(j|\boldsymbol{x}_i) \end{aligned}\right. \end{equation}

In the above process, to highlight the characteristics of weighted averaging, equation $(9)$ was subjected to an identity transformation and then substituted into equations $(6)$ and $(7)$. In the above iterative process, the first equation is called the E-step, and the latter three are called the M-step. The whole algorithm is called the EM algorithm. Below is an animation found online showing the GMM iteration process. It can be seen that the advantage of GMM is its ability to identify general quadratic surface-shaped clusters, whereas K-Means can only identify spherical ones.

GMM iteration demonstration

Simplification

A simpler form of GMM is actually used in Capsules. In the previous discussion, we used a general normal distribution, equation $(2)$, but this requires calculating matrix inverses and determinants, which is computationally expensive. A simpler model assumes the covariance matrix is a diagonal matrix $\boldsymbol{\Sigma}_j = \text{diag}\boldsymbol{\sigma}^2_j$, where $\boldsymbol{\sigma}^2_j$ is the variance vector of category $j$. Thus, $\boldsymbol{\sigma}_j$ is the standard deviation vector, and $\boldsymbol{\sigma}_j^{l}$ represents the $l$-th component of that vector. This is equivalent to decoupling the components of $\boldsymbol{x}$, assuming each component is independent. Equation $(2)$ then becomes:

\[N(\boldsymbol{x};\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)=\prod_{l=1}^d\frac{1}{\sqrt{2\pi}\sigma_j^{l}}\exp\left(-\frac{1}{2(\sigma_j^{l})^2}(x^{l}-\mu_j^{l})^2\right)\tag{10}\]

The iteration process is also simplified:

\begin{equation} \text{EM Algorithm 2}:\left\{\begin{aligned} &p(j|\boldsymbol{x}_i) \leftarrow \frac{\pi_j N(\boldsymbol{x}_i;\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)}{\sum_{j=1}^k\pi_j N(\boldsymbol{x}_i;\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)}\\ &\boldsymbol{\mu}_j \leftarrow \frac{1}{\sum_{i=1}^n p(j|\boldsymbol{x}_i)}\sum_{i=1}^n p(j|\boldsymbol{x}_i)\boldsymbol{x}_i\\ &\boldsymbol{\sigma}^2_j \leftarrow \frac{1}{\sum_{i=1}^n p(j|\boldsymbol{x}_i)}\sum_{i=1}^n p(j|\boldsymbol{x}_i)(\boldsymbol{x}_i-\boldsymbol{\mu}_j)^2\,[\text{element-wise square}]\\ &\pi_j \leftarrow \frac{1}{n}\sum_{i=1}^n p(j|\boldsymbol{x}_i) \end{aligned}\right. \end{equation}

Even Simpler

If all $\sigma_{j}^{l}$ take the same constant $\sigma$, we get:

\[N(\boldsymbol{x};\boldsymbol{\mu}_j,\sigma)=\frac{1}{\sqrt{2\pi}\sigma^d}\exp\left(-\frac{1}{2\sigma^2}\Vert \boldsymbol{x}-\boldsymbol{\mu}_j\Vert^2\right)\tag{11}\]

Thus, the entire distribution becomes even simpler. Interestingly, Euclidean distance appears inside the exponent’s parentheses.

In an extreme case, as $\sigma \to 0$, the term in the parentheses becomes negative infinity. It is not hard to prove that for each $\boldsymbol{x}_i$, only the $N(\boldsymbol{x}_i;\boldsymbol{\mu}_j,\sigma^2)$ where $\Vert \boldsymbol{x}_i-\boldsymbol{\mu}_j\Vert^2$ is smallest dominates. At this time, according to equation $(4)$, $p(j|\boldsymbol{x}_i)$ is either 0 or 1 (it is 1 for the $j$ that minimizes $\Vert \boldsymbol{x}_i-\boldsymbol{\mu}_j\Vert^2$, and 0 otherwise). This indicates that any point belongs only to the cluster center closest to it, which is consistent with K-Means using Euclidean distance. Therefore, K-Means based on Euclidean distance can be seen as a limit of GMM.

The New Version of Routing

Getting back to the point, let's talk about Capsules. We said that the GMM algorithm completed the clustering process in "Matrix Capsules with EM Routing." Now let's look at how it was done in detail.

Matrix -> Vector

I must say, the notation in the new paper is a total mess. Perhaps only true masters can see the truth within a pile of chaotic symbols. Here, combining some online popular science materials and my own reading, I provide some understanding.

First, we use a matrix $\boldsymbol{P}_i$ to represent the Capsules of layer $l$. There are a total of $n$ Capsules in this layer, i.e., $i=1, \dots, n$. We use matrix $\boldsymbol{M}_j$ to represent the Capsules of layer $l+1$. This layer has a total of $k$ Capsules, grouped into $k$ categories, $j=1, \dots, k$. The Capsule matrices in the paper are $4 \times 4$, called "Pose Matrices." Then, the GMM process can begin. When performing GMM, the matrices are treated as vectors again, so in EM routing, $\boldsymbol{P}_i$ is a vector with $d=16$. The entire process uses a simplified version of GMM, which assumes the covariance matrix is diagonal.

So according to our previous discussion, we get the new dynamic routing algorithm:

\begin{equation} \text{New Dynamic Routing 1}:\left\{\begin{aligned} &p_{ij} \leftarrow N(\boldsymbol{P}_i;\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)\\ &R_{ij} \leftarrow \frac{\pi_j p_{ij} }{\sum_{j=1}^k\pi_j p_{ij} },\,\,r_{ij}\leftarrow \frac{R_{ij}}{\sum_{i=1}^n R_{ij}}\\ &\boldsymbol{M}_j \leftarrow \sum_{i=1}^n r_{ij}\boldsymbol{P}_i\\ &\boldsymbol{\sigma}^2_j \leftarrow \sum_{i=1}^n r_{ij}(\boldsymbol{P}_i-\boldsymbol{M}_j)^2\\ &\pi_j \leftarrow \frac{1}{n}\sum_{i=1}^n R_{ij} \end{aligned}\right. \end{equation}

Here I denoted $p_{ij}=N(\boldsymbol{x}_i;\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)$ and $R_{ij}=p(j|\boldsymbol{x}_i)$. I kept the notation as consistent with the original paper as possible for easier comparison. The idea of dynamic routing here is consistent with "Dynamic Routing Between Capsules," which is to take layer $l+1$ Capsules as the clustering centers of layer $l$ Capsules, just with a different clustering method.

Activation Values

In the paper "Dynamic Routing Between Capsules," the length of the vector was used to represent the significance of the feature. Can we still do that here? The answer is no. Because we used GMM for clustering, GMM is based on weighted Euclidean distance (essentially still Euclidean distance). A feature of using Euclidean distance for clustering is that the cluster center vector is the (weighted) average of the vectors within the category (as seen from the iteration formula for $\boldsymbol{M}_j$). Since it is an average, it cannot reflect the "more followers, more power" characteristic, as already mentioned in "Another New Year's Feast: From K-Means to Capsule."

Since the length of the Capsule can no longer measure the significance of the feature, we must add a scalar $a$ to represent the significance of the Capsule. Therefore, the Capsules in this paper are actually "one matrix + one scalar." This scalar is called the "activation value" in the paper, as shown in the figure:

This version of Capsule is 'Matrix + Scalar'
This version of Capsule is "Matrix + Scalar"

As the degree of significance for a Capsule, the most direct choice for $a_j$ should be $\pi_j$, because the $l+1$ layer Capsule is the cluster center, and $\pi_j$ represents the probability of this category. However, we cannot choose $\pi_j$ for two reasons:

1. $\pi_j$ is normalized, but we hope to obtain the significance of the feature itself, not its relative significance compared to other features (more intuitively, we want to perform multiple binary classifications, not one multi-class classification, so global normalization is not required).
2. $\pi_j$ indeed reflects the number of "followers" in the category, but numbers do not necessarily mean strength; there must also be unity.

How should this activation value be chosen? The formula given in the paper is:

\[a_j=logistic\left(\lambda\left(\beta_a-\beta_u\sum_i R_{ij}-\sum_h cost_j^h\right)\right)\tag{12}\]

I believe many readers, after seeing this formula and the "derivation" in the paper, still won't know what it's about. In fact, this formula has a very beautiful source—Information Entropy.

Now we use GMM for clustering. The result is obtaining a probability distribution $p(\boldsymbol{X}|j)$ to describe a category. Then the "degree of uncertainty" of this category can measure the "degree of unity" of this category. To put it more bluntly: the greater the "uncertainty" (meaning closer to a uniform distribution), the more it indicates that the category may still be in a volatile, fragmented era, where the activation value should be lower. The smaller the "uncertainty" (meaning the distribution is more concentrated), the more it indicates that the category is united and modernized, where the activation value should be higher.

Therefore, uncertainty can be used to describe this activation value, and we know that uncertainty is measured by information entropy. So we write:

\[\begin{aligned}S_j =& - \int p(\boldsymbol{x}|j)\ln p(\boldsymbol{x}|j)d\boldsymbol{x}\\ =& - \frac{1}{p(j)}\int p(j|\boldsymbol{x})p(\boldsymbol{x})\ln p(\boldsymbol{x}|j)d\boldsymbol{x}\\ =& - \frac{1}{p(j)} E\left[p(j|\boldsymbol{x})\ln p(\boldsymbol{x}|j)\right]\\ =&-\frac{1}{n\pi_j}\sum_{i=1}^n R_{ij}\ln p_{ij}\\ =&-\frac{1}{\sum_{i=1}^n R_{ij}}\sum_{i=1}^n R_{ij}\ln p_{ij}\\ =&-\sum_{i=1}^n r_{ij}\ln p_{ij}\end{aligned}\tag{13}\]

This is the $cost_j^h$ in the paper, so the $cost$ in the paper is entropy—what a clear and intuitive meaning! And the smaller the entropy, the better—what a natural logic! (Why not directly integrate to calculate the entropy of the normal distribution, but go around like this? Because integration yields a theoretical result; here we want to calculate an actual result about this batch of data based on the data itself.)

Note: If readers cannot understand sampling calculations well, please read the section "Numerical vs Sampling Calculation" in "Variational Autoencoder (II): From a Bayesian Perspective."

After simplification, the result is (the original paper's calculation result should be incorrect):

\[S_j = \frac{d}{2}+\left(\sum_{l=1}^d \ln \boldsymbol{\sigma}_j^l+\frac{d}{2}\ln (2\pi)\right)\sum_i r_{ij}\tag{14}\]

Supplementary Derivation: The assumption here is that $p_{ij}$ corresponds to a $d$-dimensional independent normal distribution, so we only need to find the entropy of each dimension and sum them up:
\[\begin{aligned}S_j^l =& \sum_{i} - r_{ij} \ln p_{ij}^l\\ =&\sum_{i} - r_{ij} \ln \left[\frac{1}{\sqrt{2\pi}\boldsymbol{\sigma}_j^l}\exp\left(-\frac{(\boldsymbol{x}_i^l-\boldsymbol{\mu}_{j}^l)^2}{2(\boldsymbol{\sigma}_{j}^l)^2}\right)\right]\\ =&\left(\frac{1}{2}\ln 2\pi + \ln\boldsymbol{\sigma}_j^l\right)\sum_{i} r_{ij}+\frac{\sum_{i}r_{ij}\left(\boldsymbol{x}_i^l-\boldsymbol{\mu}_{j}^l\right)^2}{2(\boldsymbol{\sigma}_{j}^l)^2} \end{aligned}\]
Notice that the numerator of the last term is actually the calculation formula for variance, and the denominator is twice the variance, so the last term is "variance/variance" divided by 2, which is:
\[S_j^l = \sum_{i} - r_{ij} \ln p_{ij}^l=\left(\frac{1}{2}\ln 2\pi + \ln\boldsymbol{\sigma}_j^l\right)\sum_{i} r_{ij}+\frac{1}{2}\]
Thus:
\[S_j = \sum_{l=1}^d S_j^l =\left(\frac{d}{2}\ln 2\pi + \sum_{l=1}^d\ln\boldsymbol{\sigma}_j^l\right)\sum_{i} r_{ij}+\frac{d}{2}\]

Since smaller entropy means more significance, we use $-S_j$ to measure the degree of significance, but we want to compress it between 0 and 1. Then we can apply a simple scale transformation and activate it with a sigmoid function:

\[a_j = sigmoid\left( \lambda \left( \beta_a - \left(\beta_u+\sum_{l=1}^d \ln \boldsymbol{\sigma}_j^l \right)\sum_i r_{ij}\right)\right)\tag{15}\]

Equation $(15)$ and $(13)$ are basically equivalent. The above equation is equivalent to a weighted sum of $-S_j$ and $\pi_j$, effectively considering both $-S_j$ (unity) and $\pi_j$ (numbers). $\beta_a, \beta_u$ are optimized through backpropagation, and $\lambda$ increases gradually through the training process (annealing strategy; this is the paper's choice, which I believe is unnecessary). $\beta_a, \beta_u$ might be related to $j$, meaning a set of training parameters $\beta_a, \beta_u$ could be assigned to each upper-layer Capsule. I say "might" because the paper doesn't state it clearly; perhaps readers can adjust according to their own experiments and needs.

The Routing Appears

With the formula for $a_j$, and since we previously said $a_j$ and $\pi_j$ share some similarities as both are weights for categories, to make the entire routing process more compact, Hinton simply replaced $\pi_j$ with $a_j$. Although this replacement doesn't perfectly match the original GMM iteration process, the meaning is similar and it can converge. Thus, we now obtain the corrected dynamic routing:

\begin{equation} \text{New Dynamic Routing 2}:\left\{\begin{aligned} &p_{ij} \leftarrow N(\boldsymbol{P}_i;\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)\\ &R_{ij} \leftarrow \frac{a_j p_{ij} }{\sum_{j=1}^k a_j p_{ij} },\,\,r_{ij}\leftarrow \frac{R_{ij}}{\sum_{i=1}^n R_{ij}}\\ &\boldsymbol{M}_j \leftarrow \sum_{i=1}^n r_{ij}\boldsymbol{P}_i\\ &\boldsymbol{\sigma}^2_j \leftarrow \sum_{i=1}^n r_{ij}(\boldsymbol{P}_i-\boldsymbol{M}_j)^2\\ & cost_j \leftarrow \left(\beta_u+\sum_{l=1}^d \ln \boldsymbol{\sigma}_j^l \right)\sum_i r_{ij} \\ & a_j \leftarrow sigmoid\left( \lambda \left(\beta_a - cost_j\right)\right) \end{aligned}\right. \end{equation}

Hold on! We are very close to the end. Now this algorithm basically has no problems, but you will notice that if multiple layers of Capsules are connected, we missed a detail: $\boldsymbol{P}_i$ is the Capsule matrix from the previous layer, which we already used, but what about the activation value of the previous layer (denoted as $a^{last}_i$)? Don't forget that in Matrix Capsules, every Capsule is a matrix + scalar. Using only the matrix is incomplete. Hinton inserted the activation value into the $r_{ij}\leftarrow \frac{R_{ij}}{\sum_{i=1}^n R_{ij}}$ step, so:

\begin{equation} \text{New Dynamic Routing 3}:\left\{\begin{aligned} &p_{ij} \leftarrow N(\boldsymbol{P}_i;\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)\\ &R_{ij} \leftarrow \frac{a_j p_{ij} }{\sum_{j=1}^k a_j p_{ij} },\,\,r_{ij}\leftarrow \frac{a^{last}_i R_{ij}}{\sum_{i=1}^n a^{last}_i R_{ij}}\\ &\boldsymbol{M}_j \leftarrow \sum_{i=1}^n r_{ij}\boldsymbol{P}_i\\ &\boldsymbol{\sigma}^2_j \leftarrow \sum_{i=1}^n r_{ij}(\boldsymbol{P}_i-\boldsymbol{M}_j)^2\\ & cost_j \leftarrow \left(\beta_u+\sum_{l=1}^d \ln \boldsymbol{\sigma}_j^l \right)\sum_i r_{ij} \\ & a_j \leftarrow sigmoid\left( \lambda \left(\beta_a - cost_j\right)\right) \end{aligned}\right. \end{equation}

This should be the new dynamic routing algorithm mentioned in the forum—if I understood correctly—because the original paper is truly difficult to read. I believe many readers may not have noticed this last detail. Why? Because the symbols used in the paper are like this:

Terrible choice of notation in the original paper

(Surprised?)

One more thing... readers may notice that according to our definition, we should have $\sum_i r_{ij}=1$. Why not simplify it directly? Actually, I don't know why, but the $cost_j$ in the paper doesn't use $r_{ij}$, but rather $a^{last}_i R_{ij}$! That is:

\begin{equation} \text{New Dynamic Routing 4}:\left\{\begin{aligned} &p_{ij} \leftarrow N(\boldsymbol{P}_i;\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)\\ &R_{ij} \leftarrow \frac{a_j p_{ij} }{\sum_{j=1}^k a_j p_{ij} },\,\,r_{ij}\leftarrow \frac{a^{last}_i R_{ij}}{\sum_{i=1}^n a^{last}_i R_{ij}}\\ &\boldsymbol{M}_j \leftarrow \sum_{i=1}^n r_{ij}\boldsymbol{P}_i\\ &\boldsymbol{\sigma}^2_j \leftarrow \sum_{i=1}^n r_{ij}(\boldsymbol{P}_i-\boldsymbol{M}_j)^2\\ & cost_j \leftarrow \left(\beta_u+\sum_{l=1}^d \ln \boldsymbol{\sigma}_j^l \right)\sum_i a^{last}_i R_{ij} \\ & a_j \leftarrow sigmoid\left( \lambda \left(\beta_a - cost_j\right)\right) \end{aligned}\right. \end{equation}

Alright, I'm too tired to complain. Hinton can play however he wants. Even if there's a strained explanation (perhaps it means the activation of this layer should be regulated by the activation of the previous layer), it has no inspirational significance anymore.

Weight Matrices

Finally, just like in the previous article, we pair each set of indices $(i, j)$ with a weight matrix $\boldsymbol{W}_{ij}$ (called a viewpoint-invariant matrix) to obtain the "vote matrix" $\boldsymbol{V}_{ij} = \boldsymbol{P}_i \boldsymbol{W}_{ij}$. Then we perform dynamic routing, obtaining the final dynamic routing algorithm:

\begin{equation} \text{New Dynamic Routing (Complete)}:\left\{\begin{aligned} &p_{ij} \leftarrow N(\boldsymbol{V}_{ij};\boldsymbol{\mu}_j,\boldsymbol{\sigma}^2_j)\\ &R_{ij} \leftarrow \frac{a_j p_{ij} }{\sum_{j=1}^k a_j p_{ij} },\,\,r_{ij}\leftarrow \frac{a^{last}_i R_{ij}}{\sum_{i=1}^n a^{last}_i R_{ij}}\\ &\boldsymbol{M}_j \leftarrow \sum_{i=1}^n r_{ij}\boldsymbol{V}_{ij}\\ &\boldsymbol{\sigma}^2_j \leftarrow \sum_{i=1}^n r_{ij}(\boldsymbol{V}_{ij}-\boldsymbol{M}_j)^2\\ & cost_j \leftarrow \left(\beta_u+\sum_{l=1}^d \ln \boldsymbol{\sigma}_j^l \right)\sum_i a^{last}_i R_{ij} \\ & a_j \leftarrow sigmoid\left( \lambda \left(\beta_a - cost_j\right)\right) \end{aligned}\right. \end{equation}

Conclusion

Evaluation

After this analysis, it should be clear that the new version of Capsule and its routing algorithm are not complex. The key point of the new paper is the use of GMM to complete the clustering process. GMM is a clustering algorithm based on probabilistic models. By focusing on this "probabilistic" nature and looking for probability-related quantities, it is not difficult to understand the origin of the $a_j$ expression, which should be the most difficult point in understanding the whole paper. Replacing vectors with matrices appears to be merely a scheme to reduce computational and parameter volume, with no fundamental change.

However, the new paper inherits the obscure and difficult-to-understand expression of the old paper, combined with chaotic notation, which greatly increases the difficulty of our understanding. Once again, I criticize the authors' writing style.

Of course, there are still some things in the paper that are difficult to understand or whose thought process remains unclear, such as the replacement of $r_{ij}$ with $a^{last}_i R_{ij}$ in $cost_j$ mentioned above. If anyone has more insightful ideas, feel free to leave a comment for discussion.

Thoughts

So far, I have finally managed to clarify "Matrix Capsules with EM Routing." As for the code, I won't write it, as I personally don't particularly like this new Capsule and dynamic routing and don't want to reinvent the wheel.

This is my third article on understanding Capsules. Compared to other articles in Scientific Spaces, these three have "massive" lengths, carrying my thoughts and understanding of Capsules. Each article took several days to write, attempting to combine theory and popular language as much as possible, and trying to sort out the causes and effects clearly. I hope these words can help readers understand Capsules faster. Of course, the author's level is limited; if there are any misleading points, please feel free to criticize and comment.

Naturally, I hope the authors of Capsule can introduce their new theory in more intuitive and inspiring language, which would save us popularizers a lot of work. After all, Capsule may indeed be the future of deep learning—how can it be so vague?