Recording a Process of Crawling Taobao/Tmall Review Data

By 苏剑林 | May 06, 2015

Recently, I have become fascinated with data mining and machine learning. To perform data analysis, one must first have data. For commoners like us, the cheapest way to acquire data is likely using a crawler to scrape it from the web. This article records the entire process of crawling a certain product on Tmall. The approach for Taobao stores is similar, so I will not elaborate further. The focus is on analyzing the page and implementing a simple and convenient crawl using Python.

The tools I used are as follows:

Python 3—An extremely convenient programming language. Version 3.x was chosen because it handles Chinese characters more gracefully.

Pandas—An additional Python library used for data organization.

IE 11—Used to analyze the page request process (any other similar traffic monitoring tool will also work).

The remaining libraries include requests and re, which come with Python or are easily installed.

Example page (a Midea water heater): http://detail.tmall.com/item.htm?id=41464129793

Where are the reviews?

To crawl review data, you must first find where the reviews actually reside. Open the URL above and view the source code; you will find that there is no review content inside! So, where is the review data? It turns out Tmall uses AJAX encryption; it loads review data from a separate page.

This is where IE 11 comes in handy (of course, you can use other traffic monitoring tools). Before using it, open the URL above. Once the page is loaded, clear the IE 11 cache and history files, then press F12. The following interface will appear:

F12

At this point, click the green triangle button to start capturing network traffic (or simply press F5), and then click "Cumulative Reviews" on the Tmall page:

Capture

The following results appear:

Capture results

Many URLs appear under the URL column, and the review data is hidden among them! We mainly look for URLs with the type "text/html" or "application/json". After testing, it was found that Tmall's reviews are contained within the following URL:

http://rate.tmall.com/list_detail_rate.htm?itemId=41464129793&spuId=296980116&sellerId=1652490016&order=3¤tPage=1&append=0&content=1&tagId=&posi=&picture=&ua=166UW5TcyMNYQwiAiwVQX1EeUR5RH5Cd0xiNGI%3D%7CUm5Ockt1SHxBe0B0SXNOdCI%3D%7CU2xMHDJxPk82UjVOI1h2VngRd1snQSJEI107F2gFfgRlAmRKakQYeR9zFGoQPmg%2B%7CVGhXd1llXGJfa1ZsV2NeZFljVGlLdUt2TXFOc0tyT3pHe0Z6QHlXAQ%3D%3D%7CVWldfS0SMgo3FysUNBonHyMdNwI4HStHNkVrPWs%3D%7CVmhIGCIWNgsrFykQJAQ6DzQAIBwiGSICOAM2FioULxQ0DjEEUgQ%3D%7CV25OHjAePgA0DCwQKRYsDDgHPAdRBw%3D%3D%7CWGFBET8RMQ04ACAcJR0iAjYDNwtdCw%3D%3D%7CWWBAED5%2BKmIZcBZ6MUwxSmREfUl2VmpSbVR0SHVLcU4YTg%3D%3D%7CWmFBET9aIgwsECoKNxcrFysSL3kv%7CW2BAED5bIw0tESQEOBgkGCEfI3Uj%7CXGVFFTsVNQw2AiIeJxMoCDQIMwg9az0%3D%7CXWZGFjhdJQsrECgINhYqFiwRL3kv%7CXmdHFzkXNws3DS0RLxciAj4BPAY%2BaD4%3D%7CX2ZGFjgWNgo1ASEdIxsjAz8ANQE1YzU%3D%7CQHtbCyVAOBY2Aj4eIwM%2FAToONGI0%7CQXhYCCYIKBMqFzcLMwY%2FHyMdKRItey0%3D%7CQntbCyULKxQgGDgEPQg8HCAZIxoveS8%3D%7CQ3paCiQKKhYoFDQIMggwEC8SJh8idCI%3D%7CRH1dDSMNLRIrFTUJMw82FikWKxUueC4%3D%7CRX5eDiAOLhItEzMOLhIuFy4VKH4o%7CRn5eDiAOLn5GeEdnW2VeYjQUKQknCSkQKRIrFyN1Iw%3D%3D%7CR35Dfl5jQ3xcYFllRXtDeVlgQHxBYVV1QGBfZUV6QWFZeUZ%2FX2FBfl5hXX1AYEF9XXxDY0J8XGBbe0IU&isg=B2E8ACFC7C2F2CB185668041148A7DAA&_ksTS=1430908138129_1993&callback=jsonp1994

Does it look long enough to make you dizzy? Don't worry, with a little analysis, you'll find it can be boiled down to the following:

http://rate.tmall.com/list_detail_rate.htm?itemId=41464129793&sellerId=1652490016¤tPage=1

We find that Tmall is quite generous; the review page addresses are very regular (unlike JD, which is completely irregular and randomly generated). In this URL, itemId is the product ID, sellerId is the seller ID, and currentPage is the page number.

How to crawl?

After some effort, we finally found where the reviews are. Next is the crawling. How to crawl? First, analyze the page format.

Page format

We see that the page data is very standardized. In fact, it is a lightweight data exchange format called JSON (you can search for JSON). However, it isn't strictly standard JSON; actually, the content inside the square brackets [] is the valid JSON-compliant text.

Now let's start our crawl. I use the requests library in Python. Enter the following in Python:

import requests as rq
url = 'http://rate.tmall.com/list_detail_rate.htm?itemId=41464129793&sellerId=1652490016¤tPage=1'
myweb = rq.get(url)

The content of the page is now saved in the myweb variable. You can view the text content using myweb.text.

The next step is to keep only the part inside the brackets. This requires regular expressions, using the re module.

import re
myjson = re.findall('\"rateList\":(\[.*?\])\,\"tags\"', myweb.text)[0]

Uh, what does this line of code mean? Readers familiar with Python can probably understand it. If not, please read a tutorial on regular expressions. The code above means: find the following tags in the text:

"rateList":[...],"tags"

Once found, keep the brackets and the content inside them. Why not just use the brackets as the tag? It is to prevent errors in case a user's comment contains brackets.

Now we have myjson, which is standard JSON text. How to read JSON? It's simple—just use Pandas. This is a powerful data analysis tool in Python that can read JSON directly. Of course, if it were only for reading JSON, there would be no need for it; however, we also need to consider merging the review data from each page into a single table and perform preprocessing. Pandas is extremely convenient for this.

import pandas as pd
mytable = pd.read_json(myjson)

Now mytable is a standardized Pandas DataFrame:

mytable1

mytable2

If you have two tables, mytable1 and mytable2, that need to be merged, you simply do:

pd.concat([mytable1, mytable2], ignore_index=True)

And so on. For more operations, please refer to Pandas tutorials.

Finally, save the reviews as a .txt or Excel file (due to Chinese encoding issues, saving as .txt might lead to errors, so saving as Excel is a good alternative; Pandas can also read/write Excel files):

mytable.to_csv('mytable.txt')
mytable.to_excel('mytable.xls')

A few conclusions

Let's see how many lines of code we used in total?

import requests as rq
import re
import pandas as pd
url = 'http://rate.tmall.com/list_detail_rate.htm?itemId=41464129793&sellerId=1652490016¤tPage=1'
myweb = rq.get(url)
myjson = re.findall('\"rateList\":(\[.*?\])\,\"tags\"', myweb.text)[0]
mytable = pd.read_json(myjson)
mytable.to_csv('mytable.txt')
mytable.to_excel('mytable.xls')

Nine lines! In fewer than ten lines, we completed a simple crawler program and successfully scraped data from Tmall! Are you eager to try it out?

Of course, this is just a simple example. For practical use, some features should be added, such as finding out how many pages of reviews there are in total and reading them page by page. Additionally, batch acquisition of product IDs should be implemented. These are left to your own creativity; they are not difficult problems. This article merely hopes to serve as a starting point, providing a simple guide for readers who need to crawl data.

The most difficult problem among these is likely that after mass collection, you might be detected by Tmall's own anti-crawling system, which may require you to enter a captcha to continue. This is much more complex. Solutions include using proxies, using longer intervals between crawls, or using OCR systems to recognize captchas. I do not have a perfect solution for this either.