By 苏剑林 | November 25, 2016
I must say, Question B of the 2013 National Mathematical Modeling Contest is truly a once-in-a-century masterpiece among mathematical modeling competitions: the problem is concise and clear, its meaning is rich, the approaches are diverse, and it has strong extensibility—to the point that I have always been obsessed with it. Because of this problem, I have already written two articles on Scientific Space: "A Person's Mathematical Modeling: Shredded Paper Restoration" and "Late by One Year Modeling: Re-exploring Shredded Paper Restoration". Previously, when I did this problem, I only had a little knowledge of mathematical modeling. Since learning about data mining, especially deep learning, I have always wanted to redo this problem, but I kept procrastinating. I finally implemented it these past few days.
If readers are not clear about the problem, they can refer to the two previous articles. Shredded paper restoration has five attachments, representing five types of "shredded paper fragments," i.e., fragments of five different granularities. Attachments 1 and 2 are not difficult; the difficulty is mainly concentrated in Attachments 3, 4, and 5, where the implementation difficulty for 3, 4, and 5 is basically the same. The easiest approach to think of for this problem is a greedy algorithm: choose any image, find the image that matches it best, and then continue matching the next one. For the greedy algorithm to be effective, the key is to find a good distance function to determine if two fragments are adjacent (horizontally adjacent; vertical adjacency is not considered here).
The previous two articles used the Euclidean distance of edge vectors and mentioned some indicators like correlation coefficients. However, if these are used for Attachments 3, 4, and 5, the results are poor. The reason is that the fragments in Attachments 3, 4, and 5 are not large, and there isn't much edge information available. Therefore, relying only on the edges is not enough; one must also consider multiple factors, such as line spacing, average position of lines, etc. How exactly should these be considered? It is difficult for a human to write a good function. That being the case, why not leave it to the model? Just use a Convolutional Neural Network (CNN)!
Specifically, we can observe the condition of the fragments, which are roughly:
With these features, we can take a bunch of text, follow this specification, and construct a large number of adjacent and non-adjacent samples with the same properties. Then, train a convolutional neural network to automatically obtain a "distance function." This kind of task is extremely simple for friends familiar with deep learning. I directly found some Chinese text and generated a batch of Chinese samples. The code is roughly as follows:
from PIL import Image, ImageFont, ImageDraw
import numpy as np
from scipy import misc
import pymongo
from tqdm import tqdm
texts = list(pymongo.MongoClient().weixin.text_articles.find().limit(1000))
text = texts[0]['text']
line_words = 30
font_size = 44
nb_columns = line_words*font_size/72+1
def gen_img(text):
n = len(text) / line_words + 1
size = (nb_columns*72, (n*font_size/180+1)*180)
im = Image.new('L', size, 255)
dr = ImageDraw.Draw(im)
font = ImageFont.truetype('simhei.ttf', font_size)
for i in range(n):
dr.text((0, 70*i), text[line_words*i: line_words*(i+1)], font=font)
im = np.array(im.getdata()).reshape((size[1], size[0]))
r = []
for j in range(size[1]/180):
for i in range(size[0]/72):
r.append(1-im[j*180:(j+1)*180, i*72:(i+1)*72].T/255.0)
return r
sample = []
for i in tqdm(iter(texts)):
sample.extend(gen_img(i['text']))
np.save('sample.npy', sample)
nb_samples = len(sample) - len(sample)/nb_columns
def data(sample, batch_size):
sample_shuffle_totally = sample[:]
sample_shuffle_in_line = sample[:]
while True:
np.random.shuffle(sample_shuffle_totally)
for i in range(0, len(sample_shuffle_in_line), nb_columns):
subsample = sample_shuffle_in_line[i: i+nb_columns]
np.random.shuffle(subsample)
sample_shuffle_in_line[i: i+nb_columns] = subsample
x = []
y = []
for i in range(0, len(sample), nb_columns):
subsample_1 = sample[i: i+nb_columns]
for j in range(0, nb_columns-1):
x.append(np.vstack((subsample_1[j], subsample_1[j+1])))
y.append([1])
subsample_2 = sample_shuffle_totally[i: i+nb_columns]
for j in range(0, nb_columns-1):
x.append(np.vstack((subsample_2[j], subsample_2[j+1])))
y.append([0])
subsample_3 = sample_shuffle_in_line[i: i+nb_columns]
for j in range(0, nb_columns-1):
x.append(np.vstack((subsample_3[j], subsample_3[j+1])))
y.append([0])
if len(y) >= batch_size:
yield np.array(x), np.array(y)
x = []
y = []
if y:
yield np.array(x), np.array(y)
x = []
y = []
The process is as follows: I found 1,000 articles, each with several thousand words, uniformly printed each article onto an image, and then cropped them to obtain a batch of samples (sample). The data function that follows is an iterator used to generate positive and negative samples; because loading everything into memory at once is unfeasible, it must be done through an iterator. Even so, I was quite lazy because it still consumed 18G of memory on my server. sample_shuffle_totally consists of random negative samples obtained after completely shuffling the sample; sample_shuffle_in_line is shuffled only within the same line, producing negative samples where line spacing and line positions are the same, but the content is different. I reshuffle every iteration to improve data performance (data augmentation), which greatly increases the number of negative samples actually participating in training.
It should be noted that we are considering horizontal adjacency, but when Python reads matrices, it reads from top to bottom, not left to right. Therefore, we must transpose the image matrix. In addition, the images are normalized; the original grayscale images in the 0–255 range are normalized to the 0–1 range to speed up convergence. Finally, the image matrix is subtracted from 1, essentially performing a color inversion—"black text on white background" becomes "white text on black background." Because when training the network, we hope the input has a large amount of zeros to speed up convergence, and in colors, white is 255 while black is 0. Therefore, "white text on black background" converges faster than "black text on white background."
Then we use them to train a CNN. This process is very conventional, using a stack of three convolutional and pooling layers, followed by a softmax classifier—it's that simple~
Model Structure:
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) (None, 144, 180) 0 _________________________________________________________________ convolution1d_4 (Convolution1D) (None, 143, 32) 11552 _________________________________________________________________ maxpooling1d_4 (MaxPooling1D) (None, 71, 32) 0 _________________________________________________________________ convolution1d_5 (Convolution1D) (None, 70, 32) 2080 _________________________________________________________________ maxpooling1d_5 (MaxPooling1D) (None, 35, 32) 0 _________________________________________________________________ convolution1d_6 (Convolution1D) (None, 34, 32) 2080 _________________________________________________________________ maxpooling1d_6 (MaxPooling1D) (None, 17, 32) 0 _________________________________________________________________ flatten_2 (Flatten) (None, 544) 0 _________________________________________________________________ dense_3 (Dense) (None, 32) 17440 _________________________________________________________________ dense_4 (Dense) (None, 1) 33 ================================================================= Total params: 33,185 _________________________________________________________________
Code:
from keras.layers import Input, Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Model
input = Input((144, 180))
cnn = Convolution1D(32, 2)(input)
cnn = MaxPooling1D(2)(cnn)
cnn = Convolution1D(32, 2)(cnn)
cnn = MaxPooling1D(2)(cnn)
cnn = Convolution1D(32, 2)(cnn)
cnn = MaxPooling1D(2)(cnn)
cnn = Flatten()(cnn)
dense = Dense(32, activation='relu')(cnn)
dense = Dense(1, activation='sigmoid')(dense)
model = Model(input=input, output=dense)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit_generator(data(sample, batch_size=1024),
nb_epoch=100,
samples_per_epoch=nb_samples*3
)
model.save_weights('2013_suizhifuyuan_cnn.model')
Actually, 95% accuracy can be achieved after just 3 iterations; this nb_epoch=100 was written arbitrarily~ If computational resources are sufficient and you are not in a hurry, more iterations won't hurt. I eventually achieved 97.7% accuracy.
Now we can test the performance of the "distance function" we trained.
import glob
img_names = glob.glob(u'附件3/*')
images = {}
for i in img_names:
images[i] = 1 - misc.imread(i, flatten=True).T/255.0
def find_most_similar(img, images):
imgs_ = np.array([np.vstack((images[img], images[i])) for i in images if i != img])
img_names_ = [i for i in images if i != img]
sims = model.predict(imgs_).reshape(-1)
return img_names_[sims.argmax()]
img = img_names[14]
result = [img]
images_ = images.copy()
while len(images_) > 1:
print len(images_)
img_ = find_most_similar(img, images_)
result.append(img_)
del images_[img]
img = img_
images_ = [images[i].T for i in result]
compose = (1 - np.hstack(images_))*255
misc.imsave('result.png', compose)
For Attachment 3, the result of one-time stitching is (zoom in to see the original image):
Attachment 3 Restoration Degree (CNN + Greedy Algorithm)
For comparison, here is the result of one-time stitching using Euclidean distance previously:
Attachment 3 Restoration Degree (Euclidean Distance + Greedy Algorithm)
It can be seen that the improvement is very significant. Although this model was trained with Chinese corpora, when applied directly to the judgment of Attachment 4, the results are also good:
Attachment 4 Restoration Degree (CNN + Greedy Algorithm)
Similarly, for comparison, the stitching result of Attachment 4 using Euclidean distance is:
Attachment 4 Restoration Degree (Euclidean Distance + Greedy Algorithm)
Thus, it can be seen that the model we obtained is indeed effective and has strong generalization capabilities. Of course, the direct stitching of English is not as good; it is best to include English corpora in the training to achieve better results.
A good problem must be rich in connotation and enduring. This is already the third article I have written about the shredded paper restoration problem; perhaps there will be a 4th or a 5th, as every study is a deeper dive...
To reprint, please include the address of this article: https://kexue.fm/archives/4100
If you have any doubts or suggestions, please continue the discussion in the comments section below.
If you think this article is good, welcome to share / donate to this article. Donation is not to obtain profit, but to know how much sincere attention Scientific Space has received from readers. Of course, if you ignore it, it will not affect your reading. Welcome and thank you again!