Exploration of OCR Technology: 5. Text Segmentation

By 苏剑林 | June 24, 2016

After the previous step of obtaining single-line text areas, we can then determine how to segment a single line of text into individual characters. Since the model in the third step is built for individual characters, this step is also necessary.

Uniform Segmentation

Based on the assumption of square Chinese characters, the simplest segmentation method is actually uniform segmentation. This means cutting the single-line text into square images according to the height without any additional judgment. This approach can handle most single-line text, as shown in the top image below.

Uniform segmentation success

Uniform segmentation failure

Of course, the drawbacks of uniform segmentation are also obvious. Most Chinese characters are square, but most English letters and digits are not. Therefore, if mixed Chinese and English text occurs, uniform segmentation fails, as shown in the bottom image above.

Statistical Segmentation

As seen from Figure 15, after the preceding operations, characters are well-separated from one another. Thus, another relatively simple approach is to perform a vertical summation of the single-line text image; columns where the sum is 0 are the cut columns.

This statistical approach can effectively solve the problem of segmenting single-line text images with mixed Chinese and English. However, it also has certain drawbacks. The most obvious is that characters like "小" (small) or "的" (of) might be split into multiple parts.

Sequential Comparison

A better approach is to combine the results of the two previous ideas, determining whether to cut by comparing whether the preceding and succeeding areas form a square. The specific steps are:

1. Use the statistical summation approach to find candidate cut lines;

2. If the distance from a candidate cut line to the left and right candidate cut lines exceeds 1.2 times the ~~(width)~~ height length, then that candidate cut line is confirmed as a cut line;

3. If the resulting area is a distinct long horizontal rectangle and cannot be segmented according to the above two steps, then perform uniform segmentation.

These three steps are relatively simple and based on two assumptions: 1. The width-to-height ratio of digits and English characters is greater than 60%; 2. The width-to-height ratio of Chinese characters is less than 1.2. After testing, this algorithm can be well-applied to the segmentation of text features extracted in the previous steps.