Uncovering the Mist: A Delicious Capsule Feast

By 苏剑林 | January 23, 2018

Geoffrey Hinton at Google Toronto Office
Geoffrey Hinton at Google's Toronto Office

The Capsule paper "Dynamic Routing Between Capsules", open-sourced by deep learning pioneer Geoffrey Hinton, was undoubtedly one of the hottest topics in the deep learning community last year. Thanks to various media hyperboles, Capsules have been shrouded in mystery, with phrases like "abandoning gradient descent" and "overthrowing deep learning" appearing frequently. Yet, others feel that Capsule is nothing more than a new marketing hype.

This article attempts to lift the confusing mist, grasp the principles and charm behind Capsules, and enjoy this "Capsule Feast." Simultaneously, I have supplemented this with an experiment of my own design, which demonstrates the effectiveness of Capsules more powerfully than the experiments in the original paper.

The Menu:

1. What is a Capsule?
2. Why do it this way?
3. Is Capsule truly good?
4. What do I think of Capsules?
5. A few side dishes.

Preface

The Capsule paper has been out for several months now, and many experts have provided interpretations and open-source implementations of CapsuleNet. These resources have accelerated my understanding. However, I find that most online interpretations are merely polished translations of the paper, lacking an explanation of the underlying principles. For example, regarding the "Dynamic Routing" part, most basically copy the algorithm from the paper and mention that it converges in three iterations. But what is it converging to? Neither the paper nor the interpretations explain this, which is clearly unsatisfactory. No wonder a reader on Zhihu commented:

The so-called Capsule is just another fancy "trick" concept contributed to DL. I call it a trick because Hinton didn't explain why the routing algorithm needs those specific steps, cycles within cycles—is there any theoretical basis? Or was it just cobbled together?

This comment might be extreme, but it is hit close to home: Why should we blindly follow a set of algorithms Hinton presented without explanation?

The Capsule Feast

Banquet Specialties

The specialty of this Capsule feast is "vector in, vector out," replacing the traditional "scalar in, scalar out." This means the input and output of neurons have become vectors, which is framed as a revolution in neural network theory. But is it really? Haven't we done "vector in, vector out" tasks before? We have, and plenty of them! In NLP, a sequence of word vectors can be seen as "vector in." After encoding by RNN/CNN/Attention, the output is a new sequence, which is "vector out." Modern deep learning is full of "vector in, vector out" cases, so this alone isn't the revolution of Capsules.

The revolution of Capsule lies in: It proposes a new transmission scheme for "vector in, vector out," and this scheme is largely interpretable.

If asked why deep learning (neural networks) is effective, I usually answer: Neural networks achieve layer-by-layer abstraction of input by stacking layers. This process simulates human hierarchical classification to an extent, achieving the final target output and possessing good generalization ability. Indeed, neural networks should work this way, yet they cannot prove they are strictly doing so. This is the lack of interpretability—the reason many view deep learning as a "black box."

Let's see how Hinton uses Capsules to break through this.

The "Big Pot Dish" (Grand Casserole)

If I were to use a dish to describe Capsules, I'd think of the Hakka "Big Pot Dish" (Pencai):

Pencai has a long history in Hakka cuisine. It involves using a large basin and layering ingredients inside to blend their flavors. Rich materials are stacked layer by layer, with those that absorb juices most easily placed at the bottom. As you eat down through the layers, the juices merge, creating a fragrant and rich taste that progressively worsens... in a good way.

Capsules are designed for this "layer-by-layer progression" goal. To be honest, the writing style of the Capsule paper leaves much to be desired, so I will try not to use the exact same symbols as the paper to avoid confusing readers. Let's look at a diagram.

capsule diagram

As shown, lower-level capsules and higher-level capsules form specific connection relationships. Wait, what is a "capsule"? Simply put, if you treat a vector as a single unit, it is a "capsule." Yes, you read that right. You can understand it like this: A neuron is a scalar; a capsule is a vector. It's that blunt! Hinton's interpretation is: Each capsule represents an attribute, and the vector of the capsule represents the "frame" (instantiation parameters) of that attribute. That is, instead of just using a scalar to say if a feature exists (e.g., whether there are feathers), we use a vector to represent not just "if," but "what kind" (e.g., color, texture of feathers). This provides a richer expression of individual features.

This reminds me of word vectors in NLP. Previously, we used one-hot encoding to represent a word, simply indicating its presence. Now we use word vectors, which are richer—not only indicating presence but also semantic similarity. Are word vectors the "capsules" of NLP? The analogy might be a bit forced, but the gist is correct.

How do these capsules operate to embody the characteristics of "layer-by-layer abstraction" and "layer-by-layer classification"? Let's look at a subset of the connections:

individual capsule connections

The diagram only shows the connections for $\boldsymbol{u}_1$. This means we already have the feature $\boldsymbol{u}_1$ (say, feathers), and I want to know which higher-level feature $\boldsymbol{v}_1, \boldsymbol{v}_2, \boldsymbol{v}_3, \boldsymbol{v}_4$ (say, chicken, duck, fish, dog) it belongs to. We are familiar with classification: isn't it just an inner product followed by softmax? Thus, based solely on feature $\boldsymbol{u}_1$, we derive the probabilities of it belonging to chicken, duck, fish, or dog as: $$\big(p_{1|1},p_{2|1},p_{3|1},p_{4|1}\big) = \frac{1}{Z_1}\Big(e^{\langle\boldsymbol{u}_1,\boldsymbol{v}_1\rangle}, e^{\langle\boldsymbol{u}_1,\boldsymbol{v}_2\rangle}, e^{\langle\boldsymbol{u}_1,\boldsymbol{v}_3\rangle}, e^{\langle\boldsymbol{u}_1,\boldsymbol{v}_4\rangle}\Big)\tag{1}$$ We naturally expect $p_{1|1}$ and $p_{2|1}$ to be significantly larger than $p_{3|1}$ and $p_{4|1}$. However, a single feature isn't enough; we need to synthesize various features. So, we repeat this for every $\boldsymbol{u}_i$, obtaining $\big(p_{1|2},p_{2|2},p_{3|2},p_{4|2}\big)$, $\big(p_{1|3},p_{2|3},p_{3|3},p_{4|3}\big)$, etc.

The question is: with so many predictions, which one do I choose? And I'm not actually doing a final classification; I want to merge these features to form higher-level features. Hinton believes that since feature $\boldsymbol{u}_i$ yields a probability distribution $\big(p_{1|i},p_{2|i},p_{3|i},p_{4|i}\big)$, I can split this feature into four parts: $\big(p_{1|i}\boldsymbol{u}_i, p_{2|i}\boldsymbol{u}_i, p_{3|i}\boldsymbol{u}_i, p_{4|i}\boldsymbol{u}_i\big)$. Then, I transmit these parts to $\boldsymbol{v}_1, \boldsymbol{v}_2, \boldsymbol{v}_3, \boldsymbol{v}_4$ respectively. Finally, $\boldsymbol{v}_1, \boldsymbol{v}_2, \boldsymbol{v}_3, \boldsymbol{v}_4$ are simply the accumulations of features passed from all lower levels. $$\boldsymbol{v}_j = squash\left(\sum_{i} p_{j|i} \boldsymbol{u}_i\right) = squash\left(\sum_{i} \frac{e^{\langle\boldsymbol{u}_i,\boldsymbol{v}_j\rangle}}{Z_i} \boldsymbol{u}_i\right)\tag{2}$$ Looking from top to bottom, a Capsule is where each lower-level feature performs its own classification, and then the results are integrated. At this point, $\boldsymbol{v}_j$ should be as close as possible to all $\boldsymbol{u}_i$ that voted for it, where proximity is measured by the inner product. Thus, from bottom to top, $\boldsymbol{v}_j$ is essentially a cluster center for various $\boldsymbol{u}_i$. The core idea of Capsules is that the output is a type of clustering result of the input.

Now, let's see what this $squash$ thing is.

Concentrated Juice (Squash)

The term "squash" can refer to a concentrated fruit drink; let's taste it as such. This drink exists because Hinton wanted Capsules to have a specific property: the length (norm) of the capsule vector represents the probability of that feature existing.

Actually, I don't like the term "probability" here, because it reminds us of normalization, which is quite troublesome. I think "significance of the feature" is better. The larger the norm, the more significant the feature. We want a bounded index to measure this "significance," so we must compress the norm—hence, "concentration is the essence." Hinton's chosen compression scheme is: $$squash(\boldsymbol{x})=\frac{\Vert\boldsymbol{x}\Vert^2}{1+\Vert\boldsymbol{x}\Vert^2}\frac{\boldsymbol{x}}{\\Vert\boldsymbol{x}\Vert}\tag{3}$$ Where $\boldsymbol{x}/\Vert\boldsymbol{x}\Vert$ is easy to understand—it scales the vector to unit length. But how do we understand the first part? Why choose this? In fact, there are many ways to compress the norm into the 0-1 range, such as: $$\tanh \Vert\boldsymbol{x}\Vert, \quad 1-e^{-\Vert\boldsymbol{x}\Vert}, \quad \frac{\Vert\boldsymbol{x}\Vert}{1+\Vert\boldsymbol{x}\Vert}$$ Hinton's specific reasoning for the current version isn't entirely certain. Perhaps every scheme could be explored? In some experiments, I found that choosing: $$squash(\boldsymbol{x})=\frac{\Vert\boldsymbol{x}\Vert^2}{0.5+\Vert\boldsymbol{x}\Vert^2}\frac{\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert}$$ works slightly better. This function's characteristic is that it amplifies near zero rather than compressing globally like the original function.

However, a question worth considering: Is this compression necessary for middle layers? Since dynamic routing exists, the network already possesses non-linearity even without the $squash$ function. Therefore, it might not be strictly necessary to compress features to 0-1 in every layer, just as standard neural networks don't always use sigmoid to compress every layer's output. This needs verification in practice.

Dynamic Routing

Notice equation $(2)$. To calculate $\boldsymbol{v}_j$, you need the softmax, but to calculate the softmax, you need to know $\boldsymbol{v}_j$. Isn't this a "chicken and egg" problem? This is where the "Main Course" comes in: "Dynamic Routing." It can update (part of) the parameters based on its own characteristics, initially achieving Hinton's goal of moving away from gradient descent (for routing).

How was this "Main Course" conceived? Where does it converge? Let's serve two side dishes first, then savor the main course.

Side Dish 1

Let's go back to ordinary neural networks. As everyone knows, activation functions are crucial. Of course, the functions themselves are simple—for a tanh-activated fully connected layer in TensorFlow:

y = tf.matmul(W, x) + b
y = tf.tanh(y)
But what if I want to use the inverse function of $x = y + \cos y$ as an activation? That is, you have to solve $y=f(x)$ for me first, and then use it as the activation function.

Mathematicians tell us that the inverse of this is a transcendental function, which cannot be expressed finite-ly using elementary functions. Does that mean we're stuck? No, we have iteration: $$y_{n+1}=x-\cos y_n$$ By choosing $y_0 = x$ and iterating a few times, we get a very accurate $y$. If we iterate three times: $$y=x-\cos\big(x-\cos(x-\cos x)\big)$$ In TensorFlow, this would be:

y = tf.matmul(W, x) + b
Y = y
for i in range(3):
    Y = y - tf.cos(Y)
If you have already "pre-studied" Capsules, you will find this very similar to the Dynamic Routing process.

Side Dish 2

Consider another example, which might have many counterparts in NLP but also occurs in computer vision. Consider a vector sequence $(\boldsymbol{x}_1,\boldsymbol{x}_2,\dots,\boldsymbol{x}_n)$. I want to find a way to integrate these $n$ vectors into a single vector $\boldsymbol{x}$ (encoder) to use for classification.

One might think of LSTM. But here, I only want to represent it as a linear combination of the original vectors: $$\boldsymbol{x}=\sum_{i=1}^{n} \lambda_i \boldsymbol{x}_i$$ Here, $\lambda_i$ measures the similarity between $\boldsymbol{x}$ and $\boldsymbol{x}_i$. But wait—how can we determine similarity before $\boldsymbol{x}$ even exists? Another chicken and egg problem. The solution is also iteration. We can define a similarity metric based on softmax, then set: $$\boldsymbol{x}=\sum_{i=1}^{n} \frac{e^{\langle\boldsymbol{x},\boldsymbol{x}_i\rangle}}{Z} \boldsymbol{x}_i$$ Initially, we know nothing, so we take $\boldsymbol{x}$ as the mean of all $\boldsymbol{x}_i$. Substitute this into the right side to get a new $\boldsymbol{x}$, and repeat. Typically, it converges in a few iterations. This iterative process can be embedded into a neural network.

If Side Dish 1 shares a common spirit with dynamic routing, Side Dish 2 is practically a sibling. I haven't seen existing work doing exactly this, but it serves as a good mental model.

Serving the Main Course!

With these two side dishes, the mystery of Dynamic Routing vanishes. To obtain the $\boldsymbol{v}_j$, assume they all initially equal the mean of the $\boldsymbol{u}_i$, then iterate. Simply put, the output is a clustering result of the input, and since clustering usually requires an iterative algorithm, this iteration is called "Dynamic Routing." As for details, they aren't fixed; they depend on the clustering algorithm. For instance, the newer Capsule paper "MATRIX CAPSULES WITH EM ROUTING" uses Gaussian Mixture Models for clustering.

With this understanding, we can write the Dynamic Routing algorithm used here:

Dynamic Routing Algorithm

Initialize $b_{ij}=0$
Iterate $r$ times:
$\quad \boldsymbol{c}_i \leftarrow softmax(\boldsymbol{b}_i)$;
$\quad \boldsymbol{s}_j \leftarrow \sum_{i} c_{ij} \boldsymbol{u}_i$;
$\quad \boldsymbol{v}_j \leftarrow squash(\boldsymbol{s}_j)$;
$\quad b_{ij} \leftarrow \langle\boldsymbol{u}_i,\boldsymbol{v}_j\rangle$.
Return $\boldsymbol{v}_j$.

Here, $c_{ij}$ is the $p_{j|i}$ mentioned earlier.

"Hey, I caught a mistake! I read the paper, and it should be $b_{ij} \leftarrow b_{ij} + \langle\boldsymbol{u}_i,\boldsymbol{v}_j\rangle$, not $b_{ij} \leftarrow \langle\boldsymbol{u}_i,\boldsymbol{v}_j\rangle$!"

In fact, the algorithm above is NOT wrong—if you accept the derivation in this article and Equation $(2)$, then the iteration process above is correct.

"Is Hinton wrong? Who are you to challenge Hinton?" Hold on, let's analyze what happens with Hinton's version. If we follow Hinton's algorithm, $b_{ij} \leftarrow b_{ij} + \langle\boldsymbol{u}_i,\boldsymbol{v}_j\rangle$, after $r$ iterations, it becomes: $$\boldsymbol{v}_j^{(r)}=squash\left(\sum_i\frac{e^{\big\langle\boldsymbol{u}_{i},\,\boldsymbol{v}_{j}^{(0)}+\boldsymbol{v}_{j}^{(1)}+\dots+\boldsymbol{v}_{j}^{(r-1)}\big\rangle}}{Z_i}\boldsymbol{u}_{i}\right)$$ Since $\boldsymbol{v}_j^{(r)}$ will eventually approach the true $\boldsymbol{v}_j$, we can write: $$\boldsymbol{v}_j^{(r)}\sim squash\left(\sum_i\frac{e^{r\langle\boldsymbol{u}_{i},\,\boldsymbol{v}_j\rangle}}{Z_i}\boldsymbol{u}_{i}\right)$$ If we were to iterate infinitely (not possible due to compute, but theoretically interesting), as $r \to \infty$, the result of the softmax becomes "winner-take-all" (either 0 or 1). This means each lower-level capsule connects to exactly one higher-level capsule.

Is this reasonable? I don't think so. Different categories can share features—just as cats and dogs are different but have similar eyes. Some explain this by saying $r$ is a hyperparameter that shouldn't be too large to prevent overfitting. I don't know if Hinton shares that view, but I believe that if $r$ is just a tuning hyperparameter, it makes the Capsule theory "ugly."

Dynamic routing is already criticized as "incomprehensible." Adding a counter-intuitive hyperparameter makes it worse. Conversely, if we start from Equation $(2)$, we get the algorithm in this post, which aligns with clustering theory. Theoretically, it's more elegant because then, the larger $r$, the better (limited only by compute); there is no "too large" hyperparameter. In fact, after changing this and running it on open-source Capsule code, I achieved the same results. I'll leave the choice to the reader, but as someone with a bit of "theoretical perfectionism," I can't stand inconsistencies.

Model Details

Below are details of the Capsule implementation. The corresponding code is in my GitHub (Keras version). Compared to previous implementations, mine is pure Keras and uses K.local_conv1d to replace the author's K.map_fn, which is several times faster. This is because K.map_fn doesn't automatically parallelize. I also implemented a shared-parameter version using K.conv1d. Environmental setup: Python 2.7 + TensorFlow 1.8 + Keras 2.1.4.

Fully Connected Version

Regardless of whether it's Hinton's version or mine, if $\boldsymbol{v}_j$ can be calculated iteratively, does that mean there are no parameters? Have we truly abandoned backpropagation?

No. If that were the case, all $\boldsymbol{v}_j$ would be identical. As established, $\boldsymbol{v}_j$ is a cluster center of input $\boldsymbol{u}_i$. To "view features from different angles," we must multiply each capsule by a transformation matrix before it enters the next layer. So Equation $(2)$ becomes: $$\boldsymbol{v}_j = squash\left(\sum_{i} \frac{e^{\langle\hat{\boldsymbol{u}}_{j|i},\,\boldsymbol{v}_j\rangle}}{Z_i} \hat{\boldsymbol{u}}_{j|i}\right),\quad \hat{\boldsymbol{u}}_{j|i} = \boldsymbol{W}_{ji}\boldsymbol{u}_i\tag{4}$$ Where $\boldsymbol{W}_{ji}$ is the matrix to be trained—specifically, matrix-vector multiplication. Thus, the Capsule layer looks like this:

fully connected capsule

Now we have the full dynamic routing:

Full Dynamic Routing Algorithm

Initialize $b_{ij}=0$
Iterate $r$ times:
$\quad \boldsymbol{c}_i \leftarrow softmax(\boldsymbol{b}_i)$;
$\quad \boldsymbol{s}_j \leftarrow \sum_{i} c_{ij} \hat{\boldsymbol{u}}_{j|i}$;
$\quad \boldsymbol{v}_j \leftarrow squash(\boldsymbol{s}_j)$;
$\quad b_{ij} \leftarrow \langle\hat{\boldsymbol{u}}_{j|i} , \boldsymbol{v}_j\rangle$.
Return $\boldsymbol{v}_j$.

This Capsule layer is clearly analogous to a fully connected layer in a standard neural network.

Shared Version

Fully connected layers handle fixed-length inputs. CNNs, however, handle varying image sizes, resulting in varying feature counts. The fully connected Capsule fails here because the number of parameter matrices equals (number of input capsules) $\times$ (number of output capsules). If the input count is unfixed, we can't use a fixed set of weights.

shared capsule parameters

Like weight sharing in CNNs, we need a weight-shared Capsule. "Shared" means that for a fixed upper-level capsule $j$, the transformation matrix used for all connections from lower-level capsules is the same, i.e., $\boldsymbol{W}_{ji} \equiv \boldsymbol{W}_j$.

As shown, the shared version isn't hard to understand. From a bottom-up view, all input vectors are mapped via the same matrix, clustered, and output. Repeating this several times produces several output vectors (capsules). Alternatively, from a top-down view, each transformation matrix is an "identifier" for the higher-level capsule to recognize if the lower-level capsules contain specific features. Naturally, the parameter count of this version doesn't depend on the number of input capsules, so it can easily follow a CNN. For the shared version, Equation $(2)$ becomes: $$\boldsymbol{v}_j = squash\left(\sum_{i} \frac{e^{\langle\hat{\boldsymbol{u}}_{j|i},\boldsymbol{v}_j\rangle}}{Z_i} \hat{\boldsymbol{u}}_{j|i}\right),\quad \hat{\boldsymbol{u}}_{j|i} = \boldsymbol{W}_{j}\boldsymbol{u}_i\tag{5}$$ The dynamic routing algorithm remains unchanged.

Backpropagation

Although I'm not a fan of the term, we must use it here.

Now that we have $\boldsymbol{W}_{ji}$, how do we train them? The answer is backpropagation. If you're confused about how dynamic routing and backpropagation coexist: it's simple. Like "Side Dish 1," the iterations (three steps in the paper) are embedded into the model. Formally, it's just adding three layers to the model. Everything else proceeds normally: build a loss and backpropagate.

Thus, there is not only backpropagation inside Capsules, but *only* backpropagation, because the dynamic routing has been integrated as part of the model architecture, not as a separate optimization algorithm.

What has been achieved?

It is time to review. What has Capsule done? Simply put, it provides a new "vector in, vector out" scheme, not unlike CNN, RNN, or Attention layers. From Hinton's intent, it provides a new scheme based on clustering to replace pooling for feature integration, with a more powerful feature expression capability.

Experiments

MNIST Classification

Unsurprisingly, Capsules were first tested on MNIST and performed well. By perturbing values inside the capsules to reconstruct images, researchers found these values represented specific physical meanings, proving Capsules achieved their initial goal.

In a Capsule classification model, the final layer outputs 10 vectors (capsules), each representing a class. The norm of the vector represents the probability. Effectively, Capsule treats multi-class classification as multiple binary classifications. It doesn't use standard cross-entropy, but rather: $$L_c = T_c \max(0, m^+ − \Vert \boldsymbol{v}_c\Vert)^2 + \lambda (1 − T_c) \max(0, \Vert \boldsymbol{v}_c\Vert − m^−)^2$$ Where $T_c$ is 1 if the class is present and 0 otherwise. The paper also compared the performance boost when adding a reconstruction network.

Overall, the paper's experiments were a bit rough. Using MNIST is a bit weak (at least use Fashion MNIST), and the reconstruction network was just two simple fully connected layers. But the goal was likely just to prove the workflow... so it's acceptable.

My Experiment

Since standard CNNs already reach 99%+ accuracy on MNIST, claiming Capsules work just based on that is unconvincing. I designed a new experiment. It demonstrates Capsules' ability to integrate features. Capsules not only work; they work beautifully.

The experiment is as follows:

1. Using the existing MNIST dataset, train a digit recognition model. But instead of a 10-way softmax, treat it as 10 binary classification problems. This can be done with both CNN+Pooling and CNN+Capsule.
2. After training, test the model. But the test images aren't standard. Instead, take two random images from the test set and stitch them together. See if the model can predict both digits (correct digits regardless of order).
Wait—the training set is 1-to-1 (one digit per image), but the test set is 2-to-2 (two digits per image).

The experiment was done in Keras; code is available on my GitHub. I'll show the core parts here.

First, the CNN. For fairness, both models use the same CNN backbone:

# CNN part, identical for both models
input_image = Input(shape=(None,None,1))
cnn = Conv2D(64, (3, 3), activation='relu')(input_image)
cnn = Conv2D(64, (3, 3), activation='relu')(cnn)
cnn = AveragePooling2D((2,2))(cnn)
cnn = Conv2D(128, (3, 3), activation='relu')(cnn)
cnn = Conv2D(128, (3, 3), activation='relu')(cnn)

First, modeling with standard Pooling + Fully Connected layers:

cnn = GlobalAveragePooling2D()(cnn)
dense = Dense(128, activation='relu')(cnn)
output = Dense(10, activation='sigmoid')(dense)

model = Model(inputs=input_image, outputs=output)
model.compile(loss=lambda y_true,y_pred: y_true*K.relu(0.9-y_pred)**2 + 0.25*(1-y_true)*K.relu(y_pred-0.1)**2,
              optimizer='adam',
              metrics=['accuracy'])

This model has about 270,000 parameters and reaches 99.3%+ on standard MNIST. Now we test the custom task, looking at two accuracies: 1) the two highest scores, and 2) the two highest scores where both must exceed 0.5 (since it's binary classification).

# Reorder and stitch test set. Each image has two different digits.
idx = range(len(x_test))
np.random.shuffle(idx)
X_test = np.concatenate([x_test, x_test[idx]], 1)
Y_test = np.vstack([y_test.argmax(1), y_test[idx].argmax(1)]).T
X_test = X_test[Y_test[:,0] != Y_test[:,1]] # ensure digits are different
Y_test = Y_test[Y_test[:,0] != Y_test[:,1]]
Y_test.sort(axis=1) # sort for comparison

Y_pred = model.predict(X_test)
greater = np.sort(Y_pred, axis=1)[:,-2] > 0.5 
Y_pred = Y_pred.argsort()[:,-2:]
Y_pred.sort(axis=1)

acc = 1.*(np.prod(Y_pred == Y_test, axis=1)).sum()/len(X_test)
print u'Accuracy (ignoring confidence): %s'%acc
acc = 1.*(np.prod(Y_pred == Y_test, axis=1)*greater).sum()/len(X_test)
print u'Accuracy (considering confidence): %s'%acc

After repeated tests, ignoring confidence gave ~40% accuracy; considering confidence gave ~10%. These are conservative numbers; often, it was even lower.

Now look at the Capsule's performance. Replace the code after the CNN with:

capsule = Capsule(10, 16, 3, True)(cnn)
output = Lambda(lambda x: K.sqrt(K.sum(K.square(x), 2)))(capsule)

model = Model(inputs=input_image, outputs=output)
model.compile(loss=lambda y_true,y_pred: y_true*K.relu(0.9-y_pred)**2 + 0.25*(1-y_true)*K.relu(y_pred-0.1)**2,
              optimizer='adam',
              metrics=['accuracy'])

Using the shared-weight Capsule, the model has ~270,000 parameters and ~99.3% on standard MNIST. Both are similar initially.

However, the result is astonishing: on the new "dual-digit" test set, the Capsule model achieves over 90% accuracy for both metrics! Even without specific training for multiple digits, the Capsule identifies the features with high confidence.

Of course, if you trained a regular CNN+Pooling on dual-digit images, it would work well too. But the point is that the old architecture struggles with generalization (transfer ability). Standard CNN+Pooling needs to be "hand-held" for every specific task, while Capsules show an ability to generalize—something we truly desire.

Final Thoughts

It Looks Good

Capsules aim for interpretable neural network solutions. From this perspective, Capsule is a success, at least as a "beta" version. Its goal isn't just winning accuracy benchmarks, but providing an excellent, interpretable representation. My experiment suggests Capsules integrate features more like the human eye than pooling does.

In fact, representing probability via vector norm reminds me of the wave function in quantum mechanics, where the norm squared represents probability. This suggests that future Capsule development could draw inspiration from quantum mechanics.

Needs Optimization

Clearly, there's much to optimize theoretically and practically. I believe the "ugliest" part isn't dynamic routing, but the $squash$ function. Is compression necessary for non-output layers? Using a norm to represent probability forces the norm to be less than 1, but when two vectors with norms < 1 are added, the result isn't necessarily < 1, requiring further compression. This feels subjective. It likely requires manifold analysis or quantum mechanics analogies for a more elegant solution.

Practically, Capsules are slow. Embedding the clustering iteration into the network doesn't add much to forward passes, but it explodes the backpropagation cost because gradients of compound functions become highly complex.

Is Backpropagation Bad?

Hinton's likely reason for wanting to discard backpropagation is that it lacks a biological counterpart (the need for exact derivatives).

Actually, I disagree. While exact derivatives are rare in nature, they represent our progress. Without gradients, we optimize via "trial and interpolation"—change parameter $\alpha$ from 3 to 5, see if loss decreases. This is gradient descent in spirit but only adjusts one parameter at a time. With millions of parameters, we'd need millions of trials for every single step. Calculating gradients is a superior trick that allows us to adjust everything at once. Why not use it?

Is Pooling Bad?

Hinton thinks pooling is unscientific, but I think its utility depends on the application. MNIST (28x28) might not need it, but what about a 1000x1000 image? Don't things further away look less detailed? Isn't that a form of pooling?

I think pooling is useful for low-level features, but problematic for high-level information. Modern CNNs often use Global Pooling at the final layer, which destroys spatial transferability (as seen in my experiment). If we strictly avoid pooling, isn't a stride=2 convolution similar to a stride=1 plus 2x2 pooling anyway? In my Capsule experiment, I used pooling with Capsules, and the results didn't worsen.

Conclusion

This is likely the longest single blog post I've written. I hope everyone enjoyed this Capsule dinner!

Lastly, I can't help but admire Hinton's naming. He renamed neural networks "Deep Learning," and it caught fire. Now, he's put clustering into neural networks and called it "Dynamic Routing." I wonder if it will repeat the glory of the deep learning revolution? (Laughs, then vanishes!)