By 苏剑林 | June 25, 2016
Through the first and second steps, we have been able to identify the areas of individual characters in the image. Next, we can build a corresponding model to recognize single characters.
Model Selection
Regarding the model, we chose the Convolutional Neural Network (CNN) model from deep learning. By using multiple layers of convolutional neural networks, we constructed a recognition model for single characters. Convolutional Neural Networks are a type of artificial neural network that has become the mainstream model in the current field of image recognition. It reduces the complexity of the network model and the number of weights through 局部感知野 (Local Receptive Fields) and 权值共享 (Weight Sharing) methods. Its structure is more similar to biological neural networks, which predicts that it will inevitably have superior effects. In fact, the main reasons we chose convolutional neural networks are:
- Automatic feature extraction from original images: The convolutional neural network model can directly take the original image as input, eliminating the difficult core part of manual feature extraction required by traditional models;
- Higher precision than traditional models: For example, in the MNIST handwritten digit recognition task, it can achieve a precision of over 99%, which is much higher than the precision of traditional models;
- Better generalization ability than traditional models: This means that the deformation of the image itself (scaling, rotation) and the noise on the image have no significant impact on the recognition results, which is exactly what a good OCR system requires.
Training Data
To train a good model, there must be a sufficient amount of training data. Fortunately, although there is no ready-made data available, since we are only doing recognition of printed fonts, we can use a computer to automatically generate a batch of training data. Through the following steps, we constructed a relatively sufficient batch of training data:
- More details: Since the structure of Chinese characters is more complex than numbers and English, in order to reflect more detailed information, I used $48\times 48$ grayscale images to construct samples as input for the model;
- Commonly used Chinese characters: To ensure the practicality of the model, we crawled hundreds of thousands of articles from the WeChat Public Platform, merged them to count their respective frequencies, and finally selected the 3,000 most frequent Chinese characters (in this article, we only consider simplified characters). Adding 26 letters (uppercase and lowercase) and 10 numbers, a total of 3,062 characters were used as the output of the model;
- Sufficient data: We manually collected 45 different fonts, ranging from standard Songti, Heiti, and Kaiti to non-standard handwriting styles, basically covering various printed fonts comprehensively;
- Artificial noise: Each font was used to build images in 5 different font sizes (46 to 50), with 2 images for each size. To enhance the model's generalization ability, 5% random noise was added to each sample.
Through the above steps, we generated a total of $3062\times 45\times 5\times 2=1377900$ samples as training samples, which shows that the amount of data is sufficient.
Model Structure
Regarding the model structure, there is some previous work that can be referenced. A similar example is the recognition of MNIST handwritten digits—which often serves as a "touchstone" for new image recognition models—where over 60,000 handwritten digit images of size $28\times 28$ pixels are recognized. This case shares certain similarities with our implementation of a Chinese character recognition system, so the model structure can be borrowed. A common model structure for recognizing MNIST handwritten digits through a convolutional neural network is shown in Figure 17.
Figure 17: A network structure used for MNIST handwritten digit recognition
Figure 18: The network structure used in this article to recognize printed Chinese characters
After sufficient training, a network structure like Figure 17 can achieve an accuracy of over 99%, indicating that this structure is indeed viable. However, it is obvious that there are only 10 handwritten digits, while common Chinese characters number in the thousands; in the classification task of this article, there are 3,062 targets. That is to say, Chinese characters have a more complex and fine structure, so various aspects of the model need to be adjusted. First, regarding the input of the model, we have increased the image size from $28\times 28$ to $48\times 48$, which can preserve more details. Secondly, the model structure needs to be adjusted for complexity, including: increasing the number of convolution kernels, increasing the number of hidden nodes, and adjusting weights. Our final network structure is shown in Figure 18. Regarding the activation function, we chose the ReLu function:
\begin{equation} ReLu(x)=\left\{\begin{aligned}&x,\quad x>0\\ &0,\quad x\leq 0\end{aligned}\right.\tag{13} \end{equation}
Experiments show that compared to traditional sigmoid, tanh, and other activation functions, it can greatly improve the model effect [3][4]; in terms of preventing overfitting, we used the Dropout method most commonly used in deep learning networks [5], which involves randomly making some neurons dormant. This is equivalent to training multiple different networks simultaneously, thereby preventing the overfitting phenomenon that might occur in some nodes. It should be pointed out that regarding the model structure, we actually did a lot of screening work. For example, for the number of neurons in the hidden layer, we spent several days trying values like 512, 1024, 2048, 4096, 8192, and finally obtained a more suitable value of 1024. Too many nodes make the model too large and prone to overfitting; too few lead to underfitting and poor results. Our testing found that from 512 to 1024, there was a significant improvement in sensitivity; however, further increasing nodes did not yield significant improvements and sometimes caused a significant decrease.
Model Implementation
Our model was completed on a server with the CentOS 7 operating system (24-core CPU + 96G RAM + GTX960 GPU), written in Python 2.7, using Keras as the deep learning library and Theano as the GPU acceleration library (Tensorflow kept reporting memory overflow and configuration was unsuccessful). In terms of the training algorithm, the Adam optimization method was used for training, with a batch size of 1024, iterating for 30 cycles; one iteration took approximately 700 seconds.
When visually similar characters appear, high-frequency characters should be more likely. The most typical example is "日" (sun) and "曰" (to say); the features of these two are very similar, but the frequency of "日" is much higher than "曰", therefore, "日" should be prioritized. Therefore, when training the model, we also adjusted the final loss function of the model to give higher weight to high-frequency characters, which can improve the predictive performance of the model. After multiple debuggings, we finally obtained a relatively reliable model. The convergence process of the model is shown below.
Training curves: Loss and Acc (Accuracy)
Model Verification
We will verify the model from the following three aspects. Experimental results show that for single character recognition, our model is superior to Google's open-source OCR system, Tesseract.
Training Set Verification
The final trained model's inspection report on the training set is shown in Table 1.
Table 1: Summary of model training results
From Table 1, we can see that even in samples with added random noise, the model's accuracy is still 99.7%. Therefore, we can confidently say that from the perspective of single character recognition alone, our results have reached the state of the art level, and in formal fonts such as Heiti and Songti (Formal font samples refer to training samples where the fonts are Heiti, Songti, Kaiti, Microsoft YaHei, and Arial Unicode MS, which are commonly seen in printed materials), the accuracy is even higher!
Test Set Verification
We separately selected 5 font types and generated a batch of test samples using the same method (30,620 images per font, 153,100 total) to test the model, obtaining a model test accuracy of 92.11%. The test results for the five fonts are shown in Table 2.
Table 2: Model results in the test set (5% random noise)
From the table, it can be seen that even for samples outside the training set, the model effect is quite good. Next, we increased the random noise to 15% (which is quite terrible for a $48\times 48$ character image), and the test results obtained are shown in Table 3.
Table 3: Model results in the test set (15% random noise)
The average accuracy was 87.59%, which means the impact of noise is not significant, and the model can maintain an accuracy of around 90%. This indicates that the model has fully reached a practical level.