By 苏剑林 | May 17, 2017
Recently, I had a requirement to crawl some children's story corpora to train word vectors. Therefore, I found several fairy tale websites and crawled the entire sites' worth of articles. Below, I will share the process implemented in Python and combine it with my previous experience crawling Baidu Baike. This tutorial is suitable for the following needs: requiring a traversal crawl of a specified website, where the specified website has no anti-crawler measures. Under this premise, the only challenges we face are the traversal algorithm and programming skills.
To reiterate our assumptions:
1. We need to traverse the entire website to crawl the information we need;
2. The website has no anti-crawler measures;
3. Every page of the website can eventually be reached from the homepage by progressively clicking hyperlinks.
What kind of websites fit these assumptions? The answer is: quite a few. For instance, the story website we are about to crawl, as well as Baidu Baike, Hudong Baike, etc.
First, let's look at how to crawl this story website:
(For educational use only, no malicious intent~)
The Breadth-First Search (BFS) algorithm is actually very simple: for every page crawled, save all the internal hyperlinks found on that page, and then add those hyperlinks that haven't been added to the queue yet into the queue.
Finished? Finished! It’s that simple. Note that the queue follows the "First-In-First-Out" (FIFO) principle; therefore, what was described above is essentially the BFS algorithm. Writing it in Python is also very straightforward:
In just a few lines, we have implemented a general-purpose scraping framework with multi-threaded concurrency capabilities. Much of the code follows a fixed pattern and has high reusability. This is the elegance of Python—as they say, "Life is short, I use Python."
The code above already completes a general scraping framework. However, if we want to crawl Baidu Baike or Hudong Baike, we will encounter new problems. Because our previous code completed all data I/O in memory, this is fine for small sites, but for sites like Baike that have millions or even tens of millions of pages, it's naturally unsustainable. Thus, two issues need consideration: 1. Resumable crawling (breakpoint continuation); 2. Memory efficiency. In fact, both of these issues are solved by the same solution: a database. Previously, we stored the queue in a Queue and the results in a dictionary, both of which reside in memory. If we put them in a database, these two problems are naturally resolved.
For databases, I personally prefer MongoDB. Putting aside other advantages, it feels very Pythonic to me—when paired with pymongo for operations, you hardly feel like you are operating a database; it feels like you are purely using Python (by contrast, using SQL through Python usually makes it hard to avoid writing SQL statements). I won't go into the installation of MongoDB here; assuming it is already installed along with pymongo, here is the reference code for crawling Baidu Baike:
Example Output:
2017-05-17 20:20:18.428393, Crawling "Physical Anthropology", URL: http://baike.baidu.com/item/体质人类学, 167 crawled
2017-05-17 20:20:18.502221, Crawling "Group Dynamics", URL: http://baike.baidu.com/item/群体动力学, 168 crawled
2017-05-17 20:20:18.535227, Crawling "Biological Taxonomy", URL: http://baike.baidu.com/item/生物分类学, 169 crawled
2017-05-17 20:20:18.545897, Crawling "Virology", URL: http://baike.baidu.com/item/病毒学, 170 crawled
2017-05-17 20:20:18.898083, Crawling "Chromatography (Book Title)", URL: http://baike.baidu.com/item/色谱法, 171 crawled
2017-05-17 20:20:18.929467, Crawling "Molecular Biology (Natural Science Branch)", URL: http://baike.baidu.com/item/分子生物, 172 crawled
2017-05-17 20:20:18.974105, Crawling "Geochemistry (Subject Name)", URL: http://baike.baidu.com/item/地球化学, 173 crawled
2017-05-17 20:20:18.979666, Crawling "Nanotechnology (Physics Term)", URL: http://baike.baidu.com/item/纳米科技, 174 crawled
2017-05-17 20:20:19.077445, Crawling "Theoretical Chemistry", URL: http://baike.baidu.com/item/理论化学, 175 crawled
2017-05-17 20:20:19.143304, Crawling "Thermochemistry", URL: http://baike.baidu.com/item/热化学, 176 crawled
2017-05-17 20:20:19.333775, Crawling "Acoustics", URL: http://baike.baidu.com/item/声学, 177 crawled
2017-05-17 20:20:19.349983, Crawling "Mathematical Physics", URL: http://baike.baidu.com/item/数学物理, 178 crawled
2017-05-17 20:20:19.662366, Crawling "High Energy Physics", URL: http://baike.baidu.com/item/高能物理学, 179 crawled
2017-05-17 20:20:19.797841, Crawling "Physics (Natural Science Discipline)", URL: http://baike.baidu.com/item/物理学, 180 crawled
2017-05-17 20:20:19.809453, Crawling "Condensed Matter Physics", URL: http://baike.baidu.com/item/凝聚态物理学, 181 crawled
2017-05-17 20:20:19.898944, Crawling "Atom (Physics Concept)", URL: http://baike.baidu.com/item/原子, 182 crawled
The basic framework has been implemented here. Of course, when readers use it, they still need to adjust according to their needs—for example, strengthening regular expressions to clean up useless information like "Collect View My Collection 0 Useful +1 Voted", and extracting some structured information for separate storage (don't be lazy and use it directly, the results would have a lot of noise~ there's no such thing as a free lunch). All in all, everyone can give full play to their creativity based on this. After sufficient crawling with the above code, approximately 2.8 million entries can be crawled, which basically covers most commonly used terms.
Some readers might ask: Doesn't Baidu Baike have links like http://baike.baidu.com/view/52650.htm? Couldn't you traverse the entire site just by iterating through those numbers?
If you're only considering Baidu Baike, that logic is correct. However, that approach is not universal; for example, Hudong Baike does not have such links. Since we are interested here in a universal crawling solution, we still adopt this BFS traversal approach.
To reiterate: the websites and demonstration code involved in this article are for teaching and demonstration purposes only, without any malicious intent~