End-to-End Tencent CAPTCHA Recognition (46% Accuracy)

By 苏剑林 | December 14, 2016

For the latest results, please refer to: http://kexue.fm/archives/4503/

Some time ago, I was fortunate enough to obtain a batch of labeled Tencent CAPTCHA samples from a netizen (CAPTCHA sample URL: http://captcha.qq.com/getimage). So, I took some time to test a CAPTCHA recognition model.

Tencent CAPTCHA

Samples

This batch of CAPTCHAs is relatively simple, consisting of 4-digit English letters. They include both upper and lower case, but input is case-insensitive. The patterns have some level of confusion, and traditional segmentation-based solutions are likely difficult to implement. The end-to-end solution directly inputs the CAPTCHA, passes it through several convolutional layers, and then connects to several classifiers (26-way classification) to directly output four letter labels. In fact, there isn't much more to say; if you have samples, you can do it. Moreover, this framework is universal; it can be used for case-sensitive scenarios (52-way classification) or alphanumeric mixtures (just add 10 more categories).

However, there is one thing I find quite difficult to handle: the labels are case-insensitive. In this batch of samples, the labels are all lowercase, but the CAPTCHAs in the images have both uppercase and lowercase letters. Thus, if it's only a 26-way classification, we are forced to treat 'A' and 'a' as the same category. Since 'A' and 'a' look quite different visually, forcing them into one category seems to be "asking too much of the model"... I suspect this is one of the reasons why my model's accuracy isn't higher. But I don't have a better idea yet.

Code

Without further ado, here is the code:

https://github.com/bojone/n2n-ocr-for-qqcaptcha

The model is very concise and conventional (it’s just a single file, how complex could it be?).

Basically, four convolutional layers are used to extract image features, and then these image features are connected to four separate softmax layers, each categorized into 26 classes. Note that you cannot simply use TimeDistributed for convenience here; TimeDistributed shares weights, whereas here we need to output different labels for the same feature. If the weights were the same, wouldn't the results be the same? Also, note that I am using Keras with Theano as the backend, not TensorFlow. The two handle image data differently, so TensorFlow users will need to adjust accordingly.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   Connected to                     
=================================================================
input_15 (InputLayer)        (None, 3, 129, 53)        0                                          
_________________________________________________________________
convolution2d_40 (Convolution(None, 32, 127, 51)       896       input_15[0][0]                   
_________________________________________________________________
maxpooling2d_48 (MaxPooling2D(None, 32, 63, 25)        0         convolution2d_40[0][0]           
_________________________________________________________________
convolution2d_41 (Convolution(None, 32, 61, 23)        9248      maxpooling2d_48[0][0]            
_________________________________________________________________
maxpooling2d_49 (MaxPooling2D(None, 32, 30, 11)        0         convolution2d_41[0][0]           
_________________________________________________________________
activation_37 (Activation)   (None, 32, 30, 11)        0         maxpooling2d_49[0][0]            
_________________________________________________________________
convolution2d_42 (Convolution(None, 32, 28, 9)         9248      activation_37[0][0]              
_________________________________________________________________
maxpooling2d_50 (MaxPooling2D(None, 32, 14, 4)         0         convolution2d_42[0][0]           
_________________________________________________________________
activation_38 (Activation)   (None, 32, 14, 4)         0         maxpooling2d_50[0][0]            
_________________________________________________________________
convolution2d_43 (Convolution(None, 32, 12, 2)         9248      activation_38[0][0]              
_________________________________________________________________
maxpooling2d_51 (MaxPooling2D(None, 32, 6, 1)          0         convolution2d_43[0][0]           
_________________________________________________________________
flatten_15 (Flatten)         (None, 192)               0         maxpooling2d_51[0][0]            
_________________________________________________________________
activation_39 (Activation)   (None, 192)               0         flatten_15[0][0]                 
_________________________________________________________________
dense_63 (Dense)             (None, 26)                5018      activation_39[0][0]              
_________________________________________________________________
dense_64 (Dense)             (None, 26)                5018      activation_39[0][0]              
_________________________________________________________________
dense_65 (Dense)             (None, 26)                5018      activation_39[0][0]              
_________________________________________________________________
dense_66 (Dense)             (None, 26)                5018      activation_39[0][0]              
=================================================================
Total params: 48,712
_________________________________________________________________

After several dozen rounds of training, the model obtained recognition accuracies for the 1st, 2nd, 3rd, and 4th characters of 0.89, 0.72, 0.73, and 0.87, respectively. Thus, the accuracy of getting all four correct should be:

$$0.89 \times 0.72 \times 0.73 \times 0.87 \approx 0.41$$

That is, there should be about a 41% overall accuracy. After actual testing, the performance was even slightly better, with an overall accuracy of 46%. This means roughly one out of every two images is recognized correctly, which should be practical in many situations. Of course, this accuracy is specific to this batch of samples; the actual accuracy might be lower, but I estimate it should at least be around 10%? ^_^

It is not convenient to make the training samples public, and the model weights are also not directly public. If you need them, please contact me privately.

Afterword

According to the friend who sent me the samples, he is currently using an interface provided by someone else that has an overall accuracy of over 95%. I was instantly filled with respect and really wanted to learn from that person. However, that program is already commercialized, and it’s unlikely I'll be allowed to observe it. I expect they have been focused specifically on Tencent CAPTCHA recognition for a long time, unlike me, who is "broad but not deep."

Readers are welcome to provide better modeling ideas. Please feel free to give advice. The current model has fewer than 50,000 parameters and might be underfitting. I will try to tune it further when I have time.