A Preliminary Exploration of OCR Technology: 8. Comprehensive Evaluation

By 苏剑林 | June 26, 2016

Data Verification

Although the model works well in a test environment, practice is the sole criterion for testing truth. In this section, we use our own model to perform comparative verification against JD.com's test data.

Measuring the quality of an OCR system consists of two parts: (1) whether the text regions are successfully boxed; (2) whether the boxed text is successfully recognized. We use a scoring method to evaluate the recognition effect of each image. The scoring rules are as follows:

If the boxed text area matches the box file provided in JD's detection samples, 1 point is added. If the text is correctly recognized, an additional 1 point is added. Finally, the score for each image is the total points divided by the total number of characters.

According to this rule, the maximum score for each image is 2 points, and the minimum is 0 points. If the score exceeds 1, it indicates that the recognition effect is quite good. After comparison with JD's test data, our model's average score is approximately 0.84, which is passable but leaves room for improvement.

Model Overview

In this article, our goal was to establish a complete OCR system. After a series of efforts, we have basically achieved this goal.

When designing the algorithms, we closely integrated basic assumptions and started from the perspective of human visual recognition, hoping to achieve the goal with the minimum number of steps. This philosophy is fully embodied in the feature extraction and text localization sections. Similarly, out of a preference for simplicity and simulating manual processes, we chose a Convolutional Neural Network (CNN) model for optical character recognition, which achieved a high accuracy rate. Finally, combined with a language model, we improved results through dynamic programming using a relatively straightforward approach.

After testing, our system shows good results for identifying printed text and can serve as a text recognition tool for images from e-commerce, WeChat, and other platforms. A distinct feature is that our system can input the entire text image and obtain good results even in cases where the resolution is not high.

Reflections on Results

A significant deficiency in the algorithms involved in this article is the existence of many "empirical parameters," such as the choice of the $h$ parameter during clustering, the density threshold in the definition of low-density areas, the number of convolutional kernels in the CNN, and the number of nodes in the hidden layers. Because there were not enough labeled samples for research, these parameters could only be derived based on experience and a small number of samples. We look forward to having more labeled data to obtain the optimal values for these parameters.

Furthermore, in terms of recognizing text regions, there are still many points worth improving. Although we removed most non-text regions in just a few steps, these steps are still not intuitive enough and urgently need to be simplified. We believe that a good model should be based on simple assumptions and steps to achieve good results; therefore, a worthwhile task is to simplify assumptions and reduce the complexity of the workflow.

In addition, regarding text segmentation, there is actually no single automatic segmentation algorithm that can handle every situation; thus, there is significant room for improvement in this step. According to relevant literature, a CNN+LSTM model can be used to directly recognize single lines of text, but this requires a massive amount of training samples and high-performance training machines, which is likely something only large enterprises can achieve.

Clearly, there is still a lot of work that requires more in-depth research.