By 苏剑林 | January 18, 2016
In the "About the Author" section of the sidebar on this site, one line reads "Kitchen Enthusiast." Even though I'm not particularly good at cooking, the kitchen is indeed one of my hobbies. Of course, I have many interests—mathematics, physics, astronomy, computer science, etc.—I love them all and want to learn everything, which often leads to being a "jack of all trades, master of none." As mentioned in previous articles, data mining is also a hobby of mine. What interesting results might emerge when my passion for data mining intersects with my passion for the kitchen?
I did exactly that: I wrote a simple crawler to scrape a batch of recipe data from the "Home Cooking" directory of Meishi China and performed some basic data analysis. (I would like to express my sincere thanks to Meishi China. I chose them because their data is quite standardized.) The data analysis was conducted on my company's high-performance servers, which made the process incredibly smooth.
In total, I collected 18,209 recipes, encompassing 9,700 types of ingredients (including main ingredients, side ingredients, and seasonings; some may be duplicates due to non-standard naming). Of course, compared to "Big Data" standards in many other fields, this amount of data is negligible. However, in the kitchen—a realm where big data is rarely applied—it can be considered relatively substantial.
The simplest thing to do is a statistical analysis of the ingredients. Can you guess what appears the most?
Without any deep research into culinary arts, readers can probably guess that the most frequent item is definitely salt! Salt is often called the "Leader of a Hundred Flavors"; very few dishes are made without adding salt. Next is cooking wine, followed by light soy sauce—both are seasonings or condiments. This also shows that Chinese cuisine is very particular about seasonings, with a vast array of ingredients used for flavoring. Among main ingredients, potatoes appear at 28th place, pork belly at 38th, and so on.
Salt 11200 Cooking wine 4601 Light soy sauce 4413 Ginger 3671 Scallion 2854 Chicken essence 2579 White sugar 2440 Sugar 2303 Oil 2297 Garlic 2058 Eggs 1924 Soy sauce 1883 Dark soy sauce 1625 Pepper powder 1619 Sichuan peppercorns 1571 Carrots 1324 ......
By treating each recipe as a tokenized sentence, we can use this "corpus" to train a Word2Vec model and see what interesting results we get. (It doesn't matter if we don't find anything; it's all about exploration.) The entire training process was surprisingly fast, taking less than a second.
For readers who are unfamiliar, here is a brief introduction: Word2Vec is a model that can transform words into real-valued vectors. Words can only be processed by computers once they are converted into numbers. The vectors obtained by Word2Vec have special properties, such as the cosine similarity between two word vectors representing the degree of similarity between the two words.
Once the Word2Vec model is trained, the first thing we can do is compare the similarity of two words. Some results are quite expected, such as:
>>> pd.Series(model.most_similar(u'Pork belly')) 0 (Ribs, 0.882662177086) 1 (Skin-on pork belly, 0.866969347) 2 (Dried cowpeas, 0.864805340767) 3 (Quail eggs, 0.850470840931) 4 (Pickled vegetables, 0.842567443848) 5 (Duck legs, 0.841659963131) 6 (Three-yellow chicken, 0.837065219879) 7 (Old brine soup, 0.828875720501) 8 (Chicken gizzards, 0.827436089516) 9 (Crucian carp, 0.826281666756)
However, some results are rather surprising, for example:
>>> pd.Series(model.most_similar(u'Chicken')) 0 (Corn, 0.939546108246) 1 (Agrocybe aegerita, 0.914446234703) 2 (Sweet corn, 0.888315618038) 3 (Fresh shrimp, 0.88096922636) 4 (Brown beech mushroom, 0.870144784451) 5 (Red carrot, 0.86743336916) 6 (Pasta, 0.864846467972) 7 (Kewpie salad dressing, 0.860477805138) 8 (Pork loin, 0.85995388031) 9 (White mushroom, 0.855247914791)
Here, "Chicken" and "Corn" are highly similar! This indicates there must be a significant connection between them.
What is the reason behind this? The principle of Word2Vec lies in word co-occurrence. Therefore, the reason for this phenomenon might be: 1. Corn and chicken are often cooked together; 2. Corn and chicken are often cooked separately with similar ingredients.
In fact, a bit of observation reveals both are true, primarily because they are often used together to simmer soup, and their accompanying ingredients are similar:
Recipes containing chicken (partial):
140 [Chicken, Maca, Goji berries, Red dates, Longan, Lotus seeds, Ginger] 144 [Chicken, Erjingtao chili, Vegetable oil, Old ginger, Garlic, Dried chili, Sichuan peppercorn, Salt, Cooking wine, Light soy sauce, White sugar] 267 [Chicken, Potato, Green pepper, Onion, Flour, Ginger/Garlic, Small chili, Dark soy sauce, White sugar, Light soy sauce, Sichuan peppercorns/Anise... 313 [Chicken, Beer, Potato, Sichuan peppercorn, Star anise, Small chili, Ginger/Garlic, Light soy sauce, Dark soy sauce] 520 [Chicken, Breadcrumbs, Egg, Glutinous rice flour, Light soy sauce, Oyster sauce, Salt, Pepper powder] 961 [Cucumber, Chicken, Dried shrimp, Red dates, Ginger, Star anise, Scallion] 1005 [Glutinous rice, Chicken, Shrimp, Ginger slivers, Chopped scallion, Light soy sauce, Oyster sauce, Pepper powder, Starch] 1095 [Chicken, Egg white, Flour, Ginger, Garlic, Cooking wine, Salt, Black pepper] 1178 [Chicken, Milk, Salt, Pepper powder, Garlic powder, Low-gluten flour, Starch, Ice water, Peanut bits, Edible oil, Wheat... 1551 [Chicken, Button mushroom, Onion, Dried chili, Scallion, Ginger, Garlic, Sichuan peppercorn, Rock sugar, Salt]
Recipes containing corn (partial):
106 [Chicken wings, Ribs 2-3 pieces, Salt, Ginger, Celery, Cooking wine, Carrot, Corn, Red dates, Ginseng slices, Cordyceps... 172 [Yam, Corn, Dragon bone (pork bone), Salt, Ginger] 316 [Pork bone, Corn, Iron-rod yam, Red dates] 441 [Corn, Tomato, Tofu, Vegetable oil, Old ginger, Large scallion, Sichuan peppercorns, Salt, Beef powder] 450 [Yam, Ribs, Carrot, Ginger, Cooking wine, Thirteen-spice, Goji berries, Corn, Scallion segments, Star anise, Refined salt] 483 [Red carrot, Pumpkin, Celery, Corn, Broccoli, Pine nuts] 485 [Lotus root, Carrot, Shiitake mushroom, Peanuts, Red dates, Corn, Ginger slices] 509 [Ribs, Corn, King oyster mushroom, Goji berries, Bay leaf, MSG, Salt] 789 [Chicken wing, Enoki mushroom, Shiitake mushroom, White mushroom, Chicken thigh mushroom, Pholiota nameko, Dried scallop, Cordyceps flower, Corn, Salt, Ginger,... 828 [Meat cubes, Corn, Carrot, Fish sauce, Light soy sauce, Salt, Peanut oil]
It seems our little experiment has indeed produced some interesting results. From experience alone, we might not easily notice the strong correlation between "chicken" and "corn," but through data mining, as long as there is enough data, various interesting findings can be uncovered.
Similar results include: Beef and squid have a similarity of 96%, beef and potato have a similarity of 91%, and so on!
Looking at the data below, you can try to explain the result for beef and squid:
Recipes containing beef (partial):
46 [Beef, Carrot, Scallion, Curry powder, Salt, Coconut milk, Potato, Onion, Ginger, Korean soy sauce, Thai fish sauce] 70 [Beef, Wood ear mushroom, Carrot, Red bell pepper, Green pepper, Chopped scallion, Ginger slivers, Minced garlic, Peanut oil, Pixian bean paste,... 148 [Beef, Green pepper, Onion, Ginger/Garlic, Sichuan peppercorn, Bay leaf, Star anise, Dark soy sauce, Light soy sauce, Cumin powder, Pepper powder] 272 [Beef, Taro, Star anise, Bay leaf, Cinnamon, Sichuan peppercorn, Ginger, Chili, Rock sugar, Scallion, Salt, Cooking wine, Light soy sauce] 290 [Beef, Carrot, Onion, Red wine, Broth, Salt, Pepper powder, Tomato paste, Butter, Flour, Bay leaf... 404 [Beef, Bay leaf, Star anise, Sichuan peppercorn, Ginger/Garlic, Oyster sauce, Light soy sauce, Salt, Chicken essence] 433 [Bean sprouts, Beef, Scallion, Green pepper, Oyster sauce, Salt] 452 [Beef, White sugar, Sichuan peppercorn, Paste, Salt, Carrot, Soy sauce, Star anise, Garlic, Cooking wine] 455 [Beef, Dried yellow soybean paste, Thirteen-spice, Cinnamon powder, Star anise powder, Sichuan peppercorn powder, Ginger powder, Salt, Old broth] 534 [Potato, Beef, Salt, Cooking wine, Light soy sauce, Garlic, Ginger, Coriander]
Recipes containing squid (partial):
187 [Squid, Spare rib sauce, Sugar, Oyster sauce, Onion, Chili, Bell pepper] 284 [Squid, Bell pepper] 374 [Squid, Onion, White sesame, Garlic chili sauce, Scallion, Ginger, Fragrant scallion, BBQ sauce] 996 [Onion, Luffa, Squid, Oil, Salt, Soy sauce, White sugar, Cooking wine, Oyster sauce] 1468 [Squid, Round onion, Lettuce, First-grade soy sauce, Garlic chili sauce, Sugar, Oyster sauce, Cooking wine, Salt, Chicken essence] 1502 [Squid, Green/Red chili, Ginger/Garlic, Salt, White sugar, Light soy sauce, Starch, Sichuan peppercorns] 1577 [Squid, Color pepper, Yellow bean paste, Ginger] 1619 [Shrimp, White clams, Squid, Straw mushroom, Tom Yum paste, Fresh lemon leaves, Fish sauce, Coconut milk, Sugar] 1796 [Razor clams, Squid, Chives, White pepper powder, Cooking wine, Salt, Ginger] 1798 [Squid, Cumin, Salt, Peanut oil]
Another potentially meaningful attempt is mining association rules. Since the data volume isn't huge, I simply use the Apriori algorithm.
Before mining the rules, some preprocessing is done: 1. Remove salt, because there are too many instances of salt. If not removed, many of the mined rules would just contain salt, which would basically tell us "remember to put salt when cooking," a rule of little significance. 2. Remove ingredients that appear only once; these contain too little information and likely won't appear in the rules anyway, and keeping them increases computational load.
After this processing, if we set the support (proportion of rule occurrence) to 0.01 and the confidence (reliability of the rule) to 0.8, we get the following rules:
Rule Support Confidence Cooking wine--Scallion--Garlic -> Ginger 0.019935 0.912060 Sugar--Scallion--Garlic -> Ginger 0.011203 0.879310 Cooking wine--Sichuan peppercorns--Scallion -> Ginger 0.010544 0.872727 Cooking wine--Dark soy sauce--Scallion -> Ginger 0.011643 0.868852 Star anise--Scallion -> Ginger 0.016695 0.858757 Cooking wine--Sugar--Scallion -> Ginger 0.013345 0.846690 Cooking wine--Light soy sauce--Garlic -> Ginger 0.013345 0.840830 Light soy sauce--Scallion--Garlic -> Ginger 0.012741 0.840580 Dark soy sauce--Garlic -> Ginger 0.013290 0.831615 Sichuan peppercorns--Scallion -> Ginger 0.019221 0.825472 Cooking wine--Garlic -> Ginger 0.032511 0.821082 Cooking wine--Light soy sauce--Scallion -> Ginger 0.019057 0.816471 Dark soy sauce--Scallion -> Ginger 0.017903 0.808933 Cooking wine--Scallion -> Ginger 0.050799 0.805749
These mean: if cooking wine, scallion, and garlic appear, then ginger should also be added; if sugar, scallion, and garlic appear, remember to add ginger; and so on. These rules all end with ginger, telling us when ginger needs to be used. In cooking, these rules are quite meaningful (especially for beginners). These rules also indicate that in Chinese cuisine, ginger is a very important ingredient.
We can relax the conditions slightly to uncover more rules. Reducing the confidence requirement to 0.7 yields:
Rule Support Confidence Scallion--Garlic -> Ginger 0.038497 0.799316 Cinnamon--Bay leaf -> Star anise 0.018013 0.782816 Rock sugar--Cinnamon -> Star anise 0.010160 0.780591 Sichuan peppercorns--Garlic -> Ginger 0.014169 0.779456 Light soy sauce--Bay leaf -> Star anise 0.010874 0.770428 Cooking wine--Cinnamon -> Star anise 0.014938 0.764045 Cinnamon -> Star anise 0.031633 0.761905 Cinnamon--Sichuan peppercorns -> Star anise 0.015267 0.761644 Dark soy sauce--Bay leaf -> Star anise 0.011423 0.759124 Pepper powder--Scallion -> Ginger 0.014498 0.758621 Cinnamon--Dark soy sauce -> Star anise 0.013070 0.757962 Sugar--Scallion -> Ginger 0.022297 0.753247 Cinnamon--Light soy sauce -> Star anise 0.011148 0.751852 Cooking wine--Bay leaf -> Star anise 0.012411 0.750831 Sichuan peppercorns--Bay leaf -> Star anise 0.014608 0.745098 Scallion--Vinegar -> Ginger 0.011203 0.744526 Rock sugar--Light soy sauce -> Dark soy sauce 0.010929 0.742537 Bay leaf -> Star anise 0.028777 0.738028 Ginger--Cinnamon -> Star anise 0.011917 0.735593 Starch--Scallion -> Ginger 0.011038 0.730909 Ginger--Bay leaf -> Star anise 0.010215 0.723735 Ginger--Sugar--Garlic -> Scallion 0.011203 0.720848 Sugar--Garlic -> Ginger 0.015542 0.712846 Light soy sauce--Scallion -> Ginger 0.032402 0.704898 Scallion--Soy sauce -> Ginger 0.017244 0.700893
If the rules about ginger were too commonplace, the rules obtained now should be more meaningful. For example, "Cinnamon--Bay leaf -> Star anise," "Rock sugar--Cinnamon -> Star anise," and "Cinnamon--Light soy sauce -> Star anise" are likely recipes related to braised flavors (Luwei). These combinations might not be known to every amateur cook, but they can be mined through association rules.
There are also some rules with higher confidence but slightly lower support:
Rule Support Confidence Cooking wine--Dark soy sauce--Scallion--Garlic -> Ginger 0.005602 0.953271 Cooking wine--Sichuan peppercorns--Scallion--Garlic -> Ginger 0.005547 0.952830 Cooking wine--Sugar--Scallion--Garlic -> Ginger 0.007634 0.952055 Cooking wine--Cinnamon--Scallion -> Ginger 0.005711 0.936937 Cooking wine--Light soy sauce--Scallion--Garlic -> Ginger 0.007743 0.921569 Cinnamon--Sichuan peppercorns--Scallion -> Ginger 0.005217 0.913462 Cooking wine--White sugar--Garlic -> Ginger 0.005931 0.805970 Cooking wine--Scallion -> Ginger 0.050799 0.805749 Rock sugar--Dark soy sauce--Bay leaf -> Star anise 0.005162 0.803419
These are more detailed and precise seasoning formulas. Note that these are results mined automatically by a computer; our "chef" is the computer.
This article attempted to combine two of my interests—data mining and the kitchen—to produce some results that look quite interesting. In fact, some interesting results were obtained, and they do reflect some of my understanding of the kitchen, but whether they are truly interesting is for the reader to judge.
This type of mining is essentially text mining, which falls within the field of natural language processing (NLP). As seen in this article, the methods used are basic NLP methods. Knowledgeable readers will know that the difficulty in NLP lies in feature construction—how to represent a word with numbers. This article made an attempt, but because the data volume was insufficient and for other reasons, the conclusions may not be strictly accurate. This attempt wasn't exceptionally successful, but the process is worth noting. Perhaps more data could increase the value of such research results.
Intuitively, this kind of mining is meaningful. We can extract information we didn't know from an ordinary field; perhaps we knew this information but never paid close attention to it. The computer helps us discover it, allowing us to face it or utilize it better. Data mining can help us live better lives. Indeed, data mining technology should be popularized and democratized because our lives are our most important source of data.
Lastly, sharing the scraped data: Recipe Data.zip
Originally published at https://kexue.fm/archives/3587