Coming Back...

By 苏剑林 | May 15, 2016

The publication time of the previous blog post was April 15th, and as of today, it has been exactly a month without an update. However, the traffic to Scientific Spaces is still there. Thank you all for your support of this space; BoJone sincerely apologizes for the long silence. Before resuming updates, please allow me to recount the "journal of events" from the past month.

During this "vanishing" month, my main focus was on my graduation thesis and a data mining competition. First, regarding the graduation thesis: it was submitted on April 22nd, and the defense was on April 29th. Once the defense was over, my work on the graduation thesis concluded. My thesis mainly covered the applications of path integrals in describing random walks, partial differential equations, and stochastic differential equations. Since it is an undergraduate thesis, it shouldn't be too obscure; therefore, as a whole, it is relatively easy to read and can serve as an introductory tutorial for path integrals. I will slightly refine it later and publish it in several parts on Scientific Spaces for everyone to critique and provide feedback.

Speaking of path integrals, I must mention the task of providing exercise solutions for Quantum Mechanics and Path Integrals. Regrettably, over the past month, I have had essentially no time to work on the exercises. However, I will continue to work on them in the future. For the versions already published, I invite interested readers to point out any issues. I remember at the beginning of the year, a friend asked me what my wish for the year was, and I casually replied, "I hope to finish the exercises for a book." That book, of course, is Quantum Mechanics and Path Integrals. I believe I should be able to complete it this year.

The second matter is the Data Mining Competition. I have participated in this competition for two previous sessions, making this my third time. Furthermore, my collaboration with the organizers has become more frequent; I previously even recorded instructional videos. The main reason for participating again this year was my interest in Problem A provided by JD.com. The general requirement was for us to build an OCR (Optical Character Recognition, which turns text on images into editable text) system. Having never dealt with image processing problems before, I wanted to take on the challenge. Since an OCR system itself is quite valuable, I decided to study it.

Building an OCR system from scratch for a single competition seemed like a tall order, but I truly wanted to challenge myself. Thus began a series of late nights. For OCR, the most difficult part is text localization—essentially "circling the characters." In this regard, I made many attempts, trying out most mainstream methods before finally organizing my own approach. The overall results were reasonably satisfactory. Regarding recognition, I tried Convolutional Neural Networks (CNN) for the first time. To accelerate training, I specifically bought a GTX 960 graphics card for GPU acceleration. Sure enough, GPU acceleration is powerful, offering nearly a 20-fold speedup compared to the CPU. From this, I gradually grasped some techniques for constructing deep learning networks and once again experienced the power of deep learning (the accuracy for Chinese character recognition reached as high as 99%!). Finally, regarding the language model, I calculated transition probabilities through a massive amount of WeChat articles and used the Viterbi algorithm for dynamic programming. This effectively improved the recognition accuracy.

And so, what seemed impossible—completing a usable and complete OCR system from scratch in about 20 days—happened. Today is the final deadline for submitting the paper. Although the final result is not perfect, I feel it is comparable to Google's open-source Tesseract OCR. After the competition ends, I will gradually organize and open-source the results, and I am also considering making an online interface for public API calls.

However, although Convolutional Neural Networks were introduced, the basic logic remains very traditional: text localization, single-character segmentation, input into the recognition model, and improvement via a language model. Based on deep learning approaches (CNN-RNN), one can also directly recognize entire lines of text without manual segmentation; it is said that Baidu's OCR works this way. However, I personally feel that training such a model is something that only large enterprises like Baidu can realistically achieve.

At this point, I can finally relax a bit and slowly return to the blog. My major at Sun Yat-sen University (SYSU) is Pure Mathematics, so I need to brush up on my math properly.