From Boosting Learning to Neural Networks: Seeing Mountains as Mountains?

By 苏剑林 | July 01, 2016

A while ago, while teaching text mining to students at Hanshan Normal University in Chaozhou, I delved into Boosting learning algorithms and engaged in some brainstorming. Eventually, I clarified some essential characteristics of Boosting learning algorithms and obtained some unexpected results. For instance, some theoretical proofs of the AdaBoost algorithm can also be used to explain why neural network models are so powerful.

AdaBoost Algorithm

Boosting learning belongs to the category of ensemble models. Of course, rather than calling it an algorithm, it is better to describe it as a problem-solving strategy. Taking supervised classification as an example, it suggests that weak classifiers (as long as their accuracy is strictly greater than a random classifier) can be combined in a certain way to obtain an excellent classifier (theoretically, the accuracy can reach 100%).

The AdaBoost algorithm is an example of a Boosting algorithm, proposed by Schapire in 1996. It constructed an explicit scheme for Boosting learning and provided a theoretical proof regarding the error rate.

Taking a binary classification problem as an example, let us assume we have a set of samples $\{x_i, y_i\}, i=1, 2, \dots, n$, where $x_i$ is the sample data (possibly multi-dimensional input) and $y_i \in \{1, -1\}$ is the sample label. We use 1 and -1 to describe the labels instead of the usual 1 and 0 merely for the convenience of the subsequent proof, without any special underlying meaning. Next, assume we already have a weak classifier $G(x)$, such as Logistic Regression, SVM, or decision trees. The only requirement for the classifier is that its accuracy must be strictly greater than random (in a binary classification problem, it must be strictly greater than 0.5). "Strictly greater" means there exists a constant $\epsilon > 0$ such that the accuracy each time is no lower than $\frac{1}{2} + \epsilon$.

The core idea of the AdaBoost algorithm is: before each training session with the weak classifier $G(x)$, it sets different weights $w_{k,1}, w_{k,2}, \dots, w_{k,n}$ for the samples (increasing the weights of previously mispredicted samples). In this way, different model parameters $G_1(x), G_2(x), \dots, G_m(x)$ are obtained each time (this could be the same model with different parameters or a combination of several different models). They are then combined in the following manner:

$$y = \bar{G}(x) = \text{sign}[f(x)] = \text{sign}\left[\sum_{k=1}^m \alpha_k G_k(x)\right]$$

Here $f(x) = \sum_{k=1}^m \alpha_k G_k(x)$. Ultimately, it can be proven that the classifier $\bar{G}(x)$ is an excellent analytical tool. Since the AdaBoost algorithm is not the primary focus of this article, I have placed the specific mathematical details at the end to avoid deviating from the main topic.

Seeing Mountains as Mountains, Water as Water

Putting mathematical details aside, let us reflect: what exactly did the AdaBoost algorithm do? Or more broadly, what is the essence of so-called ensemble models?

When we build a classification model, the typical process is as follows:
1. Find a set of labeled data, such as emotional comments in a text sentiment classification model;
2. Construct features;
3. Choose an appropriate model, such as Logistic Regression, SVM, Neural Networks, etc.;
4. Input the model for training;
5. Test results and optimize improvement.

Beginners often focus their energy on Step 3, becoming obsessed with high-precision non-linear models. In fact, models with very high precision often involve a strong internal process of feature construction. If the features are constructed well, even simple linear models (like Logistic Regression) will have relatively high precision. In other words, the most important step in modeling should be the second step—the construction of features.

Of course, constructing good features is also the most difficult step. Constructing good features requires a thorough understanding of the data and sometimes professional background knowledge. There is no unified method for how to combine existing features to form better ones.

At this stage, we are often troubled by feature construction and look forward to surprises in prediction results. This is the stage of "seeing mountains as mountains, water as water" (seeing features as features, results as results).

Seeing Mountains Not as Mountains, Water Not as Water

However, why not be a bit more imaginative? Again, assume we have a binary classification task, such as text sentiment classification. We use existing data to train a model, say Logistic Regression, obtain the model's parameters, and output some prediction results. The accuracy of the results is higher than 50%.

Are these prediction results merely results?

From a mathematical perspective, these results are obtained through a linear combination of original data, followed by a non-linear transformation with a logistic function. Put simply, they are derived from the original features through certain calculations. Looking at this sentence alone—why not view this as a way of constructing features?

Exactly! It is a result, but it is also a feature! You use one set of weights to train a Logistic Regression model and get a batch of prediction results; you change the weights to train another Logistic Regression model with different parameters and get a new batch of prediction results; and so on. Why not treat these results as features and put them into yet another Logistic Regression model? This model, at the very least, will not be less accurate than any single original model, right? Thus, we arrive at a magical conclusion—models can be used to predict results, and they can also be used to construct features—results are features. At this point, we reach the realm of "seeing mountains not as mountains, water not as water"—the boundary between features and results has blurred.

At this point, we might have a sudden realization—the essence of Boosting learning is nothing more than treating model results as features and then building a model on top of them once more. Since the results of models are generally good features already, the final stage can significantly enhance the effectiveness of the classifier—when features are good, precision naturally follows.

At this point, I really want to sigh and quote Laozi: "The Tao that can be told is not the eternal Tao; The name that can be named is not the eternal name."

Seeing Mountains Still as Mountains, Water Still as Water

This is a new perspective: viewing model results as a constructed feature. However, have we not used this method before?

In fact, we have, though it might not be obvious. Suppose there is a binary classification problem (e.g., predicting whether someone smokes), where one of the features is gender, with values Male/Female. How do we quantify this and put it into the model? We can use 1 to represent male and the vector 0 to represent female. Can this not be seen as constructing the following model?

$$G(x) = \left\{\begin{aligned}&1, \quad x = \text{Male} \\ &0, \quad x = \text{Female}\end{aligned}\right.$$

Clearly, it can be viewed as a model (the input is gender, the output is whether they smoke, where 1 indicates smoking) to predict whether an individual smokes. This was originally just a way of representing a feature, but now it has become a prediction result of a model, and we then input this model result into a new model for prediction. Reviewing this process, is it not also treating a result as a feature to be input into a model?

It turns out we used this idea long ago; it was just not obvious enough. Thinking back now, its shadow can be found in many aspects of data mining. Perhaps we have returned to simplicity and have the feeling of "seeing mountains still as mountains, water still as water." Boosting learning algorithms (AdaBoost) can be said to have carried this idea forward. And after reading the description of the neural network section below, this feeling becomes even stronger.

Neural Networks

However, the one that has truly taken this idea to the extreme is the neural network model.

Do neural network models also have a relationship with Boosting learning? Yes, a neural network can be seen as an enhanced version of a special case of AdaBoost. From this, we can also use the theoretical proofs of AdaBoost to explain why neural networks are so powerful.

What do I mean by this? Let's review the paragraph we just discussed:

It is a result, but it is also a feature! You use one set of weights to train a Logistic Regression model and get a batch of prediction results; you change the weights to train another Logistic Regression model with different parameters and get a new batch of prediction results; and so on. Why not treat these results as features and put them into yet another Logistic Regression model? This model, at the very least, will not be less accurate than any single original model, right?

Representing this with a diagram, what kind of process is it? As shown below:

AdaBoost Learning Process

At first glance, isn't this just a three-layer network model?

Exactly, this is a three-layer neural network model! By choosing Logistic Regression as the weak classifier and combining them using the AdaBoost algorithm, the result is equivalent to a three-layer neural network model!

However, neural networks are even more sophisticated. The AdaBoost algorithm is trained step-by-step, similar to a greedy algorithm, and the result is very likely only a local optimum. In contrast, neural networks leave all parameters to be determined and use a loss function to find the global optimum. As long as the optimization algorithm is good enough, an optimal solution can be obtained. From this perspective, neural networks are superior. Furthermore, neural networks can nest more layers on this basis (which can be viewed as using AdaBoost as a weak classifier and combining it multiple times to obtain an even more excellent classifier), making the effect even better. Therefore, in today's era of Deep Learning, the significance of the AdaBoost algorithm has been diminished. What remains unchanged is that its core idea still holds very important heuristic value.

Not only that, from the perspective of "treating prediction results as features," we can also derive the structure of RNNs (Recurrent Neural Networks)! We find that the current AdaBoost algorithm, in the final step of model construction, only uses the output results of the previous models as features. Why not use both the previous output results and the original data together as features to build the model?

If we actually do this, we form the prototype of an RNN—adding the previous output to the current input.

By this point, neural networks have become a special case of the AdaBoost algorithm, but simultaneously its enhanced version. One might say that neural networks have pushed the idea of ensemble models to their pinnacle.

At this moment, it is as if Boosting is everywhere, truly a case of "seeing mountains as mountains, water as water"! Because Boosting has never been just an independent model; it is a profound problem-solving philosophy—profound thoughts have far-reaching impacts~

Supplement: Derivations for the AdaBoost Algorithm

The AdaBoost algorithm is not difficult to understand, but the current challenge is how to select the values for $w$ and $\alpha$? Schapire provided a selection scheme:

1. At the beginning, initialize $w_{1,1} = w_{1,2} = \dots = w_{1,n} = \frac{1}{n}$, and use these weights to train the classifier $G_1(x)$;
2. In each subsequent step, update $w$ and $\alpha$ in the following way:

\begin{align} &\alpha_k = \frac{1}{2} \log \frac{1-\varepsilon_k}{\varepsilon_k} \\ &w_{k+1,i} = \frac{1}{Z_k} w_{k,i} \exp(-\alpha_k y_i G_k(x_i)) \end{align}

Where $\varepsilon_k$ is the error rate, i.e., the number of mispredicted samples divided by $n$ when using the model $G_k(x)$ for prediction. Written as a mathematical formula, it is:

$$\varepsilon_k = \sum_{i=1}^n w_{k,i} I(-y_i G_k(x_i))$$

Here $I(x)$ is the indicator function:

$$I(x) = \left\{\begin{aligned}&1, \quad x > 0 \\ &0, \quad x \leq 0\end{aligned}\right.$$

When the prediction is correct, $-y_i G_k(x_i) = -1$, so $I(-y_i G_k(x_i)) = 0$. Conversely, $I(-y_i G_k(x_i)) = 1$. Thus, $\sum_{i=1}^n I(-y_i G_k(x_i))$ counts the number of errors. And $Z_k = \sum_{i=1}^n w_{k,i} \exp(-\alpha_k y_i G_k(x_i))$ is the normalization factor.

The biggest question is certainly: why choose them this way? I don't know how Schapire came up with it, but perhaps we can find some clues from the error rate analysis. The error rate $\varepsilon$ of the final classifier $\bar{G}(x)$ is estimated as:

\begin{align} \varepsilon &= \frac{1}{n} \sum_{i=1}^n I(-y_i \bar{G}(x_i)) \\ &= \frac{1}{n} \sum_{i=1}^n I(-y_i f(x_i)) \\ &< \frac{1}{n} \sum_{i=1}^n \exp(-y_i f(x_i)) \end{align}

Here we use $I(x) \leq \max(0, 1+x) < \max(0, e^x) = e^x$. At this point, we can substitute the expression for $f$:

\begin{align} &\frac{1}{n} \sum_{i=1}^n \exp(-y_i f(x_i)) \\ &= \frac{1}{n} \sum_{i=1}^n \exp\left(-y_i \sum_{k=1}^m \alpha_k G_k(x_i)\right) \\ &= \frac{1}{n} \sum_{i=1}^n \prod_{k=1}^m \exp(-y_i \alpha_k G_k(x_i)) \\ &= \sum_{i=1}^n w_{1,i} \prod_{k=1}^m \exp(-y_i \alpha_k G_k(x_i)) \\ &= \sum_{i=1}^n w_{1,i} \exp(-y_i \alpha_1 G_1(x_i)) \prod_{k=2}^m \exp(-y_i \alpha_k G_k(x_i)) \\ &= Z_1 \sum_{i=1}^n w_{2,i} \prod_{k=2}^m \exp(-y_i \alpha_k G_k(x_i)) \\ &= \dots \\ &= Z_1 Z_2 \dots Z_m \end{align}

Reviewing the expression for $Z_k$:

$$Z_k = \sum_{i=1}^n w_{k,i} \exp(-\alpha_k y_i G_k(x_i))$$

It is worth noting that while the term $\exp(-\alpha_k y_i G_k(x_i))$ looks complex, it actually has only two possibilities: $e^{-\alpha_k}$ if the prediction is correct, and $e^{\alpha_k}$ if the prediction is incorrect. Therefore:

\begin{align} Z_k &= \sum_{i=1}^n w_{k,i} \exp(-\alpha_k y_i G_k(x_i)) \\ &= \sum_{\text{correct}} w_{k,i} e^{-\alpha_k} + \sum_{\text{incorrect}} w_{k,i} e^{\alpha_k} \\ &= e^{-\alpha_k} \sum_{\text{correct}} w_{k,i} + e^{\alpha_k} \sum_{\text{incorrect}} w_{k,i} \\ &= e^{-\alpha_k}(1-\varepsilon_k) + e^{\alpha_k} \varepsilon_k \\ &= 2\sqrt{(1-\varepsilon_k)\varepsilon_k} \quad (\text{substituting } \alpha_k = \frac{1}{2} \log \frac{1-\varepsilon_k}{\varepsilon_k}) \\ &= \sqrt{1-4\gamma_k^2} \quad (\text{let } \gamma_k = \frac{1}{2}-\varepsilon_k) \\ &\leq \exp(-2\gamma_k^2) \end{align}

So:

$$\varepsilon < Z_1 Z_2 \dots Z_m < \exp\left(-2 \sum_{k=1}^m \gamma_k^2\right)$$

If $\gamma_k$ has a positive lower bound (which is why the weak classifier's accuracy must be strictly greater than random), then the error rate will tend toward 0 and decrease exponentially. This is very ideal. Of course, even if such a model results in 100% accuracy, it usually can only be used on the training data; in other words, overfitting is likely to occur. However, this is an extreme case; in practice, AdaBoost’s overfitting phenomenon is not severe.

Regarding Boosting algorithms, I won't provide further examples. More content can be found at: http://www.52caml.com/head_first_ml/ml-chapter6-boosting-family/