OCR Technology Exploration: 2. Background and Assumptions

By 苏剑林 | June 17, 2016

Research Background

Optical Character Recognition (OCR) refers to the process of converting text within images into computer-editable text content. Numerous researchers have studied related technologies for a long time, leading to the creation of many mature OCR technologies and products, such as Hanwang OCR, ABBYY FineReader, and Tesseract OCR. It is worth mentioning that ABBYY FineReader not only has high accuracy (including for Chinese recognition) but also preserves most of the original layout, making it a very powerful commercial OCR software. However, among the many established OCR products, except for Tesseract OCR, most are closed-source or even commercial software; we can neither embed them into our own programs nor improve upon them. The only choice for open source is Google's Tesseract OCR, but its recognition performance is not particularly good, and its Chinese recognition accuracy is relatively low, requiring further improvement. In summary, whether for academic research or practical application, it is necessary to explore and improve OCR technology.

Our team has divided the complete OCR system into four aspects: "feature extraction," "text localization," "optical recognition," and "language model," solving them step by step to eventually complete a usable and integrated OCR system for printed text. This system can be preliminarily used for image text recognition on platforms such as e-commerce and WeChat to verify the authenticity of information.

Research Assumptions

In this article, we assume that the text portions of the images have the following characteristics:

We assume the image fonts to be recognized are relatively standard printed fonts, such as Songti, Heiti, Kaiti, Xingshu, etc.;
There should be a relatively obvious contrast between the text and the background;
During model design, we assume the text in the images is typeset horizontally;
The strokes of the text should have a certain width and should not be too thin;
The color of a single character should at most be a gradient;
Generally, text is formed by relatively dense strokes and often exhibits a certain degree of connectivity.

As can be seen, these characteristics are common features of typical e-commerce promotional posters and similar media, making these assumptions quite reasonable.

Analysis Flow

Figure 1: Our experimental flow chart

Experimental Platform

The experiments in this article were completed in an environment of CentOS 7 + Python 2.7. Among them, the image processing portion utilized the following extension libraries: Numpy, SciPy, Pandas, and Pillow; the convolutional neural network model utilized the following extension libraries: Keras and Theano. Specific experimental configurations will be discussed further in later sections.