Conv1D-Based Spectrum Classification Model (1D Sequence Classification)

By 苏剑林 | May 02, 2018

Some time ago, Tianchi launched an Astronomical Data Mining Competition—LAMOST Spectrum Classification (identifying corresponding spectra into one of four categories). Although there was no prize money, I found it quite interesting, so I signed up. I worked on it for a while and felt my performance was decent. However, in the end, I forgot (or rather, I didn't notice at all) that there was one final step to submit results for a new test set during the last two days of the preliminary round. And that was it—I left behind an incomplete model. It could be described as "falling before the battle was won," and I was the one who did myself in~

Astronomical Data Mining Contest—Intelligent Classification of Celestial Spectra

Later, I discussed it with other contestants and found that my model was actually quite good. I remember the top score in the preliminary round was 0.83+, while my score at the time was 0.82+, ranking around 4th or 5th. Moreover, it is said that many teams with scores above 0.8 used ensemble models, while my 0.82+ score was just the result of a single model. In group chats, I've noticed many friends working on 1D sequence classification models. Since spectrum classification is essentially a 1D sequence classification problem, I am sharing the model here. I hope it will be of reference value to those interested.

Model

In fact, it is not a particularly special model; it is just standard 1D convolution plus residuals. To those familiar with image processing, this is a very common structure.

Spectrum Recognition CNN Model

Block Structure Diagram of the Spectrum CNN

The form of each Conv1D Block is shown in the diagram on the right. For specific parameters, please refer to the code below.

Principle

I will take this opportunity to explain why CNNs can be used here.

Features suitable for CNNs should satisfy local correlation, meaning that the association between features is mainly reflected locally and extends from the local to the whole. Obviously, images and audio satisfy this characteristic, which is why current mainstream image and audio recognition models use CNNs.

As for spectrum sequences, they are essentially equivalent to a function plot (2600 points), as shown below. It also possesses local correlation, so we use CNNs to solve it. Of course, using RNN models like LSTM is not impossible, but RNNs cannot be parallelized, making them hard to tolerate for such long sequences. To use 1D convolution, we need to convert the sequence into a $2600 \times 1$ format. Furthermore, we understand that human recognition of spectra relies on identifying emission and absorption lines. Simply put, these are features with significant differences (gradients). To identify such features, all our pooling layers use Max Pooling, and the activation function is exclusively ReLU (since $\text{relu}(x) = \max(x, 0)$).

Spectrum Reference Example

Additionally, because these sequences are relatively long (up to 2600 dimensions), some tricks were used: dilated convolutions were employed, the number of Blocks was maximized, and each Block was followed by a pooling layer to reduce dimensions.

Training

Before providing the training code, let me describe the training process.

The evaluation metric for the competition is the macro F1 score, which means calculating an F1 score for each class and then averaging the F1 values of the four classes. Generally, F1 scores are non-differentiable, so the macro F1 score is also non-differentiable and cannot be used directly as a loss function. However, we can write an approximation formula for the macro F1 score (actually, its negative):

def score_loss(y_true, y_pred):
    loss = 0
    for i in np.eye(4):
        y_true_ = K.constant([list(i)]) * y_true
        y_pred_ = K.constant([list(i)]) * y_pred
        loss += 0.5 * K.sum(y_true_ * y_pred_) / K.sum(y_true_ + y_pred_ + K.epsilon())
    return - K.log(loss + K.epsilon())

In this function, y_true is the one-hot distribution of the target, and y_pred is the predicted probability for each class. At first glance, this might be difficult to understand. Readers need to understand the F1 calculation formula and then express it in a computation format for one-hot inputs to gradually grasp it.

To optimize for this evaluation metric, I first used the Adam optimizer (learning rate of 0.001) with standard categorical cross-entropy to train the model to its optimum. Then, I switched to score_loss and reduced the learning rate to 0.0001 to continue training the model. These steps are reflected in the code. Training directly with the macro F1 score as the loss function from the start did not lead to convergence, which is quite strange~

Code

Below is the reference code; indeed, there is nothing particularly special about it. If you have any questions, feel free to leave a comment for discussion.

#! -*- coding: utf-8 -*-

import numpy as np
import os,glob
import pandas as pd
import json
import keras.backend as K
from keras.callbacks import Callback
from keras.utils import to_categorical
from tqdm import tqdm
np.random.seed(2018)

# Data reading.
# Spectra are stored as one sample per txt file, containing the sequence in string format.
# For readers interested in other sequence classifications, simply know that this part generates sequence samples.
class Data_Reader:
    def __init__(self):
        self.labels = pd.read_csv('../first_train_index_20180131.csv')
        self.label2id = {'star':0, 'unknown':1, 'galaxy':2, 'qso':3}
        self.id2label = ['star', 'unknown', 'galaxy', 'qso']
        self.labels['type'] = self.labels['type'].apply(lambda s: self.label2id[s])
        self.labels = dict(zip(self.labels['id'], self.labels['type']))
        if os.path.exists('../train_data.json'): # Read saved data
            print 'Found. Reading ...'
            self.train_data = json.load(open('../train_data.json'))
            self.valid_data = json.load(open('../valid_data.json'))
            self.nb_train = len(self.train_data)
            self.nb_valid = len(self.valid_data)
        else: # Build from original data if not saved
            print 'Not Found. Preparing ...'
            self.data = glob.glob('../train/*.txt')
            np.random.shuffle(self.data)
            self.train_frac = 0.8
            self.nb_train = int(self.train_frac*len(self.data))
            self.nb_valid = len(self.data) - self.nb_train
            self.train_data = self.data[:self.nb_train]
            self.valid_data = self.data[self.nb_train:]
            json.dump(self.train_data, open('../train_data.json', 'w'))
            json.dump(self.valid_data, open('../valid_data.json', 'w'))
        self.input_dim = len(self.read_one(self.train_data[0]))
        self.fracs = [0.8, 0.1, 0.05, 0.05] # Sampling ratio for each class in each batch
        self.batch_size = 160 # batch_size
        for i in range(4):
            self.fracs[i] = int(self.fracs[i]*self.batch_size)
        self.fracs[0] = self.batch_size - np.sum(self.fracs[1:])
    def read_one(self, fn):
        return np.array(json.loads('[' + open(fn).read() + ']'))
    def for_train(self): # Generate training set
        train_data = [[], [], [], []]
        for n in self.train_data:
            train_data[self.labels[int(n.split('/')[2].split('.')[0])]].append(n)
        print 'Samples:',self.fracs
        Y = np.array([0]*self.fracs[0] + [1]*self.fracs[1] + [2]*self.fracs[2] + [3]*self.fracs[3])
        Y = to_categorical(np.array(Y), 4)
        while True:
            X = []
            for i in range(4):
                for n in np.random.choice(train_data[i], self.fracs[i]):
                    X.append(self.read_one(n))
            X = np.expand_dims(np.array(X), 2)/100.
            yield X,Y
    def for_valid(self): # Generate validation set
        X,Y = [],[]
        for n in self.valid_data:
            X.append(self.read_one(n))
            Y.append(self.labels[int(n.split('/')[2].split('.')[0])])
            if len(X) == self.batch_size or n == self.valid_data[-1]:
                X = np.expand_dims(np.array(X), 2)/100.
                Y = to_categorical(np.array(Y), 4)
                yield X,Y
                X,Y = [],[]

D = Data_Reader()

from keras.layers import *
from keras.models import Model
import keras.backend as K

def BLOCK(seq, filters): # Define network Block
    cnn = Conv1D(filters*2, 3, padding='SAME', dilation_rate=1, activation='relu')(seq)
    cnn = Lambda(lambda x: x[:,:,:filters] + x[:,:,filters:])(cnn)
    cnn = Conv1D(filters*2, 3, padding='SAME', dilation_rate=2, activation='relu')(cnn)
    cnn = Lambda(lambda x: x[:,:,:filters] + x[:,:,filters:])(cnn)
    cnn = Conv1D(filters*2, 3, padding='SAME', dilation_rate=4, activation='relu')(cnn)
    cnn = Lambda(lambda x: x[:,:,:filters] + x[:,:,filters:])(cnn)
    if int(seq.shape[-1]) != filters:
        seq = Conv1D(filters, 1, padding='SAME')(seq)
    seq = add([seq, cnn])
    return seq

# Build model: standard CNN with residuals and pooling
input_tensor = Input(shape=(D.input_dim, 1))
seq = input_tensor

seq = BLOCK(seq, 16)
seq = MaxPooling1D(2)(seq)
seq = BLOCK(seq, 16)
seq = MaxPooling1D(2)(seq)
seq = BLOCK(seq, 32)
seq = MaxPooling1D(2)(seq)
seq = BLOCK(seq, 32)
seq = MaxPooling1D(2)(seq)
seq = BLOCK(seq, 64)
seq = MaxPooling1D(2)(seq)
seq = BLOCK(seq, 64)
seq = MaxPooling1D(2)(seq)
seq = BLOCK(seq, 128)
seq = MaxPooling1D(2)(seq)
seq = BLOCK(seq, 128)
seq = Dropout(0.5, (D.batch_size, int(seq.shape[1]), 1))(seq)
seq = GlobalMaxPooling1D()(seq)
seq = Dense(128, activation='relu')(seq)

output_tensor = Dense(4, activation='softmax')(seq)

model = Model(inputs=[input_tensor], outputs=[output_tensor])
model.summary()

# Define negative macro f1 score as loss
def score_loss(y_true, y_pred):
    loss = 0
    for i in np.eye(4):
        y_true_ = K.constant([list(i)]) * y_true
        y_pred_ = K.constant([list(i)]) * y_pred
        loss += 0.5 * K.sum(y_true_ * y_pred_) / K.sum(y_true_ + y_pred_ + K.epsilon())
    return - K.log(loss + K.epsilon())

# Define macro f1 score metric
def score_metric(y_true, y_pred):
    y_true = K.argmax(y_true)
    y_pred = K.argmax(y_pred)
    score = 0.
    for i in range(4):
        y_true_ = K.cast(K.equal(y_true, i), 'float32')
        y_pred_ = K.cast(K.equal(y_pred, i), 'float32')
        score += 0.5 * K.sum(y_true_ * y_pred_) / K.sum(y_true_ + y_pred_ + K.epsilon())
    return score

from keras.optimizers import Adam
model.compile(loss='categorical_crossentropy', # Cross-entropy as initial loss
              optimizer=Adam(1e-3),
              metrics=[score_metric])

try:
    model.load_weights('guangpu_highest.model')
except:
    pass

def predict():
    import time
    data = pd.read_csv('../first_test_index_20180131.csv')['id']
    data = '../test/' + data.apply(str) + '.txt'
    data = list(data)
    I,X,Y = [],[],[]
    for n in tqdm(iter(data)):
        X.append(D.read_one(n))
        I.append(int(n.split('/')[2].split('.')[0]))
        if len(X) == D.batch_size:
            X = np.expand_dims(np.array(X), 2)/100.
            y = model.predict(X)
            y = y.argmax(axis=1)
            Y.extend(list(y))
            X = []
    d = pd.DataFrame(list(zip(I, Y)))
    d.columns = ['id', 'type']
    d['type'] = d['type'].apply(lambda s: D.id2label[s])
    d.to_csv('result_%s.csv'%(int(time.time())), index=None, header=None)

if __name__ == '__main__':

    # Define Callback to calculate validation score and save best model
    class Evaluate(Callback):
        def __init__(self):
            self.scores = []
            self.highest = 0.
        def on_epoch_end(self, epoch, logs=None):
            R,T = [],[]
            for x,y in D.for_valid():
                y_ = model.predict(x)
                R.extend(list(y.argmax(axis=1)))
                T.extend(list(y_.argmax(axis=1)))
            R,T = np.array(R),np.array(T)
            score = 0
            for i in range(4):
                R_ = (R == i)
                T_ = (T == i)
                score += 0.5 * (R_ * T_).sum() / (R_.sum() + T_.sum() + K.epsilon())
            self.scores.append(score)
            if score >= self.highest: # Save best weights
                self.highest = score
                model.save_weights('guangpu_highest.model')
                json.dump([self.scores, self.highest], open('valid.log', 'w'))
            print ' score: %f%%, highest: %f%%'%(score*100, self.highest*100)

    evaluator = Evaluate()

    # Stage 1 training
    history = model.fit_generator(D.for_train(),
                                  steps_per_epoch=500,
                                  epochs=50,
                                  callbacks=[evaluator])

    model.compile(loss=score_loss, # Switch to score_loss
                  optimizer=Adam(1e-4),
                  metrics=[score_metric])

    try:
        model.load_weights('guangpu_highest.model')
    except:
        pass

    # Stage 2 training
    history = model.fit_generator(D.for_train(),
                                  steps_per_epoch=500,
                                  epochs=50,
                                  callbacks=[evaluator])