A Baseline for Fashion MNIST (MobileNet 95%)

By 苏剑林 | August 27, 2017

First Taste

Yesterday, I briefly tried a GAN model on Fashion MNIST and found that it could work. Of course, that attempt didn't involve much technical skill; it was just a matter of changing the paths in the original script and running it. Today, I returned to the main task of Fashion MNIST itself—10-class classification. I used Keras to test the effects of several models on it and eventually achieved an accuracy of around 94.5%. With data augmentation using random flipping, I was able to reach 95%.

At first, I wrote several combinations of models by hand, but testing showed that the accuracy wasn't great. It seems that for this dataset, designing a custom model is quite difficult, so I thought about using existing model architectures. When it comes to off-the-shelf CNN models, we usually think of VGG, ResNet, Inception, Xception, etc. However, these models were designed for the 1,000-class classification problem of ImageNet. Using them on this entry-level dataset seems overkill, and they are prone to overfitting. Then, I suddenly remembered that Keras comes with a model called MobileNet. After checking the model weights, I found that the parameter count isn't large, but the capacity should be sufficient. Therefore, I chose MobileNet for the experiment.

Delving Deeper

I won't introduce MobileNet too much; there are many articles online explaining it. Simply put, its philosophy is similar to Xception's, replacing most convolutions with depthwise convolutions. This depthwise convolution is somewhat similar to the SVD decomposition of a matrix; it decomposes what was once a large convolution kernel matrix into two small matrices. Ultimately, this results in fewer parameters and better performance. Newer similar work includes ShuffleNet, but as there is no Keras version yet, I set it aside.

The experiment is simple: load the MobileNet model, using the default ImageNet pre-trained weights (it's not guaranteed that ImageNet weights will help with this dataset, but they do help speed up convergence and improve precision; it seems many visual features are generic). Then, I attached a 10-class classifier for classification and unfroze all weights for training. It's worth noting that:

1. The original design of MobileNet used $224 \times 224$ inputs, while Fashion MNIST images are only $28 \times 28$, a significant difference. Although raw input doesn't necessarily throw an error, I scaled the images up by a factor of two to $56 \times 56$ to avoid losing detail. They could be scaled larger, but since there was no obvious improvement in effect, it would be a waste of computation.

2. MobileNet requires three-channel image input. To accommodate this, simply duplicate the grayscale image three times.

The entire code is as follows:

import numpy as np
import mnist_reader
from tqdm import tqdm
from scipy import misc
import tensorflow as tf

np.random.seed(2017)
tf.set_random_seed(2017)

X_train, y_train = mnist_reader.load_mnist('../data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('../data/fashion', kind='t10k')

height,width = 56,56

from keras.applications.mobilenet import MobileNet
from keras.layers import Input,Dense,Dropout,Lambda
from keras.models import Model
from keras import backend as K

input_image = Input(shape=(height,width))
input_image_ = Lambda(lambda x: K.repeat_elements(K.expand_dims(x,3),3,3))(input_image)
base_model = MobileNet(input_tensor=input_image_, include_top=False, pooling='avg')
output = Dropout(0.5)(base_model.output)
predict = Dense(10, activation='softmax')(output)

model = Model(inputs=input_image, outputs=predict)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

X_train = X_train.reshape((-1,28,28))
X_train = np.array([misc.imresize(x, (height,width)).astype(float) for x in tqdm(iter(X_train))])/255.

X_test = X_test.reshape((-1,28,28))
X_test = np.array([misc.imresize(x, (height,width)).astype(float) for x in tqdm(iter(X_test))])/255.

model.fit(X_train, y_train, batch_size=64, epochs=50, validation_data=(X_test, y_test))

The code is very simple and clear, so I won't add comments.

After multiple tests, an accuracy of over 94.5% can basically be reached within 20 epochs (even though we set a random seed, the results are still not perfectly reproducible due to CuDNN). Later epochs are unstable and show signs of overfitting.

Refining the Details

I feel that attaining an accuracy above 94.5% without data augmentation is satisfying. I then tested it with data augmentation. Thinking about it, there aren't many suitable data augmentation methods for this dataset; the only one I could think of was random horizontal flipping. Let's add it and see the results:

import numpy as np
import mnist_reader
from tqdm import tqdm
from scipy import misc
import tensorflow as tf

np.random.seed(2017)
tf.set_random_seed(2017)

X_train, y_train = mnist_reader.load_mnist('../data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('../data/fashion', kind='t10k')

height,width = 56,56

from keras.applications.mobilenet import MobileNet
from keras.layers import Input,Dense,Dropout,Lambda
from keras.models import Model
from keras import backend as K

input_image = Input(shape=(height,width))
input_image_ = Lambda(lambda x: K.repeat_elements(K.expand_dims(x,3),3,3))(input_image)
base_model = MobileNet(input_tensor=input_image_, include_top=False, pooling='avg')
output = Dropout(0.5)(base_model.output)
predict = Dense(10, activation='softmax')(output)

model = Model(inputs=input_image, outputs=predict)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

X_train = X_train.reshape((-1,28,28))
X_train = np.array([misc.imresize(x, (height,width)).astype(float) for x in tqdm(iter(X_train))])/255.

X_test = X_test.reshape((-1,28,28))
X_test = np.array([misc.imresize(x, (height,width)).astype(float) for x in tqdm(iter(X_test))])/255.

def random_reverse(x):
    if np.random.random() > 0.5:
        return x[:,::-1]
    else:
        return x

def data_generator(X,Y,batch_size=100):
    while True:
        idxs = np.random.permutation(len(X))
        X = X[idxs]
        Y = Y[idxs]
        p,q = [],[]
        for i in range(len(X)):
            p.append(random_reverse(X[i]))
            q.append(Y[i])
            if len(p) == batch_size:
                yield np.array(p),np.array(q)
                p,q = [],[]
        if p:
            yield np.array(p),np.array(q)
            p,q = [],[]

model.fit_generator(data_generator(X_train,y_train), steps_per_epoch=600, epochs=50, validation_data=data_generator(X_test,y_test), validation_steps=100)

Sure enough, data augmentation provides some benefit. I ran it twice: once achieving 95.04% and once achieving 94.91%. That is to say, an accuracy of around 95% can be reached within 50 epochs. Note that not all data augmentation methods help; I tried adding random masking, but found that the effect actually decreased. Thus, data augmentation must be adapted to the dataset, especially to the test set. In essence, while data augmentation is applied to the training set, its core is the introduction of prior knowledge about the test set.

A Long Road Ahead

It seems Fashion MNIST really has some difficulty. Unlike MNIST, where you can easily get over 90% accuracy with a single Dense layer, using it as a benchmark for CNN algorithms is quite representative. Test accuracy on MNIST generally reaches above 99%; however, based on the data I've seen so far, the highest accuracy on Fashion MNIST is only around 96% (and that's without the source code released). There is still a long way to go to reach 99%. Even for this dataset, the road is long and difficult.

I wonder which model will be the first to reach 99% accuracy~