A Brief Exploration of OCR Technology: 1. Overview

By 苏剑林 | June 17, 2016

Preface: As mentioned in previous blog posts, last month I participated in the 4th Teddy Cup Data Mining Competition. I worked on Problem A, which is related to OCR systems, and I promised to open-source the final results. I've been busy with graduation and moving recently, so I haven't had time to organize this content until now.

I am sharing these results not because they are particularly groundbreaking or advanced (on the contrary, after comparing them with Baidu's paper "Progress in Image Recognition Based on Deep Learning: Several Practices from Baidu", I realized my approach still essentially follows the traditional framework and is far from the cutting edge). Rather, I'm sharing this because while OCR technology is relatively mature, there aren't many articles online providing a detailed explanation of an OCR system's implementation. This article aims to fill that gap. I have always believed that for technology to advance, it must be open-sourced (though this is debatable in China, as open-sourcing can easily lead to copycats). Whether it's research in mathematics and physics or data mining, I usually publish most of my work on my blog to exchange ideas with everyone.

To get to the point: although the results aren't exceptional, we have implemented a relatively complete and functional OCR system. In other words, we have performed basically all the steps required to build an OCR system; as for how well they were executed, I would say it was "just acceptable." There may be some hyperbolic descriptions in the narrative; I hope readers will exercise their own judgment.

Below is our paper abstract:

We designed a series of algorithms to complete tasks such as text feature extraction and text localization. We established a character recognition model based on Convolutional Neural Networks (CNN) and finally combined it with a statistical language model to improve performance, successfully constructing a complete OCR (Optical Character Recognition) system.

In terms of feature extraction, we abandoned the traditional "edge detection + erosion/dilation" method. Based on some basic assumptions, we obtained high-quality text features through steps like grayscale clustering, layer decomposition, and denoising. These text features can be used both for text localization in the second step and directly input into the model for recognition in the third step, eliminating the need for additional feature extraction work.

In terms of text localization, we first integrated feature fragments through proximity search to obtain single-line text features, and then segmented the single-line text into individual characters using a forward-backward statistical method. Tests indicate that this segmentation approach handles mixed Chinese and English text segmentation effectively.

In terms of optical recognition, we established a single-character recognition model based on a CNN deep learning model. We self-generated 1.4 million samples for training and ultimately obtained a robust single-character recognition model. The training accuracy was 99.7%, and the test accuracy was 92.1%. Even when image noise was increased to 15%, an accuracy of approximately 90% was maintained.

Finally, to further enhance the effectiveness based on the previous work, we integrated a language model. We calculated the transition probability matrix of common Chinese characters using hundreds of thousands of texts from WeChat and used the Viterbi algorithm for dynamic programming to find the optimal recognition combination.

Combining these four parts of work yields a complete OCR system. Testing shows that our system performs well in identifying printed characters and can serve as a text recognition tool for platforms like e-commerce and WeChat.

References

[1] Li Meng; Research on Text Detection Algorithm Based on Multi-scale Gabor Filter and BP Neural Network; Computer Software and Theory; 2007
[2] Kernel Density Estimation; https://zh.wikipedia.org/zh-cn/核密度估计; Wikipedia
[3] Xavier Glorot, Antoine Bordes, Yoshua Bengio; Deep Sparse Rectifier Neural Networks
[4] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; ImageNet Classification with Deep Convolutional Neural Networks
[5] Dropout: A Simple Way to Prevent Neural Networks from Overfitting
[6] Wu Jun; "The Beauty of Mathematics" (Second Edition); Chapter 3
[7] Wu Jun; "The Beauty of Mathematics" (Second Edition); Chapter 26