Random Talk on Models | Models and Picking Mangoes

By 苏剑林 | July 15, 2015

Many people feel that terms like "Model," "Big Data," and "Machine Learning" are high-end and mysterious. In fact, they are not much different from picking fruit in our daily lives. This article uses a few thousand words to try and teach everyone how to pick mangoes...

The Model Metaphor

Suppose I want to pick the most delicious mango from a batch. Since I cannot directly cut them open to taste them, I can only observe the mangoes. The observable quantities include color, surface aroma, size, etc. These are the pieces of information (features) we can collect.

There are many such examples in life, such as buying matches (perhaps young city dwellers haven't seen matches?). How do you judge the quality of a box of matches? Must you strike every single match to see if it lights up? Obviously not; at most, we can strike a few. If we strike them all, the matches are no longer matches. Of course, we can also look at the appearance of the matches and smell their scent—these actions are acceptable.

We might discover that yellow, large mangoes are very sweet, but we also find that some not-so-yellow, small ones are also sweet. So, what specific proportions do features like color, aroma, and volume each account for? If I can find this proportion, then I have found a method to predict whether a mango is delicious. The match example is the same; we can strike a few, see which ones light up, and then summarize a method to predict whether they will light up without actually striking them.

This is exactly what a model does. First, we find a batch of mangoes (samples), record their features (color, aroma, volume, etc.), then have someone taste them and evaluate which are delicious and which are not. Consequently, based on this batch of samples, we can summarize the proportions held by color, aroma, and volume. This process of summarization is done by the machine itself.

Once finished, we obtain a model to predict whether a mango is delicious. This is a bit like a black box; in the future, by inputting data like color, aroma, and volume, we can calculate the probability of it being delicious.

The Significance of Models

From the metaphor above, we can see that the most important significance of a model is that it solves two problems:

1. "One-size-fits-all"

"One-size-fits-all" is a practice that often causes us much grief. Like in middle school, when a teacher "sentences us to death" without asking questions—this simple and crude approach is a typical "one-size-fits-all." Such practices have a certain degree of accuracy, but not all problems can be solved this way. On the contrary, "one-size-fits-all" often mistakenly "cuts out" the "superior varieties"!

For example, if I want to find high-performing students in a class, we naturally think that academic performance is proportional to the time spent studying. So we believe "studying more than 5 hours a day" defines a high-performing person. This is a "one-size-fits-all" approach. However, there are clearly people who are naturally gifted or have highly efficient study methods; they only spend one hour a day and their grades are excellent. Such people would be "cut down" by us, and obviously, those being cut are the superior varieties.

2. Automatic Learning

Let's return to the mango example. If we rely on "years of experience," even without a model, we could develop a method for judging delicious mangoes. People might say, "What's so great about your model? We can do just as well ourselves." But what if I don't want to eat mangoes now? What if I want to eat oranges or grapes? How do we predict the deliciousness of oranges or grapes? We can't wait many years to accumulate "years of experience" for lychees or apples, can we? Not to mention the time cost, it's also a waste of manpower.

Of course, others might have relevant experience with oranges and grapes, and we could ask them. But asking has a cost; just think of the various paid training activities everywhere.

Models solve this problem perfectly. They allow us to start from a batch of existing samples (whether they are mangoes, apples, or lychees) and automatically and mechanically "summarize" (this process is called learning) a set of judgment methods. Since the learning is done by machines, it saves us time and effort. We only need to brew a cup of tea and wait for the model results, then see if the results are good. This is much better than learning and summarizing ourselves and then judging our own learning effectiveness, right?

The Methodology of Modeling

To build a good model, there are generally the following steps:

1. Prepare Samples

Samples are the batch of "mango" samples we use for learning.

In fact, the process of building a model is very similar to the human learning process. If it were left to a human, we would definitely take a sample of mangoes, record their features like color, size, and aroma, then cut them all open to taste them to see which are sour or sweet, and finally summarize the patterns.

For a model, the model replaces the human summarization process, which is the final step. The preceding preparation process still needs to be completed by us. We need to taste a batch of mangoes ourselves, record the information of that batch, and then input all this information into the model. The model can then learn automatically. After learning, the model can be used to predict the taste of new mangoes.

Preparing samples means preparing both good and bad samples. In other words, you need to find a batch of delicious mangoes and record their features, and you also need to find a batch of bad mangoes and record theirs. Only when you feed all this information to the model can the model learn automatically. In this process, humans play the role of the recorder.

2. Prepare Features

Features are the variables related to the judgment results and are the basis of the model's prediction.

Simply put, features are the "whats" that the deliciousness of the mango is related to. If we feel the deliciousness of a mango is related to its size, color, and aroma, then "size," "color," and "aroma" are the model's features—provided, of course, that this information is quantified.

There are good features and bad features. Good features help the model make correct predictions, while bad features are at best useless for prediction. For example, which tree the mango was picked from or which day of the week it was picked—these are likely not good features. That is to say, this information usually does not help us judge the deliciousness of the mango. (Note that this is "usually," not absolute. Perhaps mangoes picked from tree A really are better than those from tree B.)

Good features are crucial to a model. It can be said that finding good features (whether manual or automated) is the most important part of modeling. A good data researcher should focus most of their energy on feature selection during the modeling process; however, many researchers today tend to fall into a trap, spending the vast majority of their energy on the model itself (which is step 3).

3. Prepare the Model

Preparing the model actually means selecting a model—that is, deciding which model to use for learning. This is like how people have different learning methods and experiences; it's about choosing which method to use for learning.

In the real field of machine learning, there are quite a few models. For example, they are divided into linear and non-linear models. Linear models include Logistic Regression, SVM, etc. Non-linear models include Random Forest, GBDT, Neural Networks, and so on. Regarding models, there are generally a few points to be clear about:

(1) The model is not the most important thing

In fact, the most important part of the modeling process is feature selection. If the correct features are selected, the performance difference between models won't be too large. Therefore, do not spend most of your energy on model selection;

(2) Prevent overfitting

Overfitting is a phenomenon that is relatively difficult to detect. Generally speaking, it means the resulting model performs very well on in-sample tests but miserably in actual application. Common methods to prevent overfitting include setting certain regularization coefficients (i.e., penalty functions) or setting smaller depths (for decision tree-related models);

(3) Try to use linear models

Non-linear models, such as GBDT, generally perform better, but they are also more prone to overfitting. Therefore, if the performance of the non-linear model is not significantly better than the linear model, try to use a linear model because such models have better stability. This philosophy actually aligns with Occam's Razor: "Entities should not be multiplied beyond necessity."

Finally

Of course, regardless of the situation, it must be emphasized: models are useful, but they are not omnipotent, nor are they the most important thing. Do not worship models to the point of losing our own subjective initiative. A model can be said to be a work of art—provided that you are an artist.