By 苏剑林 | June 06, 2017
This is a simplified version of the paper the author submitted for this year's Teddy Cup Question C. Although it ended up only receiving a consolation prize, I personally feel that some of the ideas presented here are still quite valuable for web crawling work. Therefore, I am sharing them here for everyone's reference.
Introduction
A crawler can be divided into two steps: 1. Downloading the web page; 2. Extracting the required information from the web page. Both steps present their own technical difficulties. For the first step, the difficulty lies in dealing with the anti-crawling measures and features of various websites, such as IP blocking or CAPTCHAs if the access frequency is too high. This requires designing specific anti-crawling strategies for different websites, and theoretically, a universal solution does not exist. For the second step, the traditional approach is to design corresponding regular expressions. However, as website designs become increasingly diverse, writing regular expressions has become correspondingly difficult.
Clearly, trying to obtain a universal crawler solution using traditional regular expression methods is quite difficult. But if we jump out of the limitations of regular expression thinking and look at a website from a global perspective, combining it with DOM tree parsing, we can obtain a fairly universal solution. Therefore, the main content of this article revolves around the second step of web crawling. The work in this article is divided into two parts: first, a scheme for information extraction suitable for general websites is proposed, and then, this scheme is refined and applied specifically to information extraction from forums.
Process
It sounds sophisticated, but the core idea can actually be summed up in one sentence: "doing subtraction": Webpage Content - Webpage Template = Effective Content. Where does the webpage template come from? It is the intersection of all pages on the same website!
The specific process is roughly as follows:
1. Download the pages of the target website and parse them into a DOM tree;
2. Compare the DOM trees of several pages to obtain the website's basic template;
3. Traverse the DOM tree in a "depth-first" manner and number every node and leaf of the tree;
4. Compare the DOM tree of each page with the standard template; the differing parts are the effective text;
5. Further filter through some manual rules;
6. Using the numbers of nodes and leaves as values, perform clustering and blocking on the effective text.
Analysis
The following is a brief explanation of the above process.
The source code of a normal web page is a mixture of HTML tags and effective text. Moreover, different websites have different templates, which brings difficulties to general crawling. However, within different pages of the same website, there is consistency between pages. In common terms: on the same website, different pages look more or less the same; the only difference is the main content of the page.
Therefore, for a general crawler, our design idea is: utilize different pages of the same website to construct a standard template for that website, and compare any page of the same website with the standard template. The differences are the effective text we need to extract. In this process, we use the DOM tree to perform syntax parsing of the web page source code.
HTML Code
A normal and relatively standard HTML page source code looks something like this:
<html>
<head>
<title>This is the Website Title</title>
</head>
<body>
<div id="content">
<h2 class="title">This is Article Title 1</h2>
<p>This is the text of the first article.</p>
<h2 class="title">This is Article Title 2</h2>
<p>This is the text of the second article.</p>
</div>
</body>
</html>
DOM Tree
We can parse the source code of an HTML page into a tree structure called a DOM tree, which stands for Document Object Model. It is a document object model that use an object representation to represent the corresponding document structure and its content. For the HTML page shown previously, it can be represented as the following DOM tree structure:

DOM Tree
In other words, a webpage is actually layers upon layers of nested HTML tags. Each tag is a node or a leaf in the tree, and the text content of the webpage is inside these tags. One benefit of parsing it into a DOM tree is that we can ignore the interference of specific styles. For example:
<div>
<h2 id="title_1" class="main_title">Title 1</h2>
<h2 id="title_2" class="sub_title">Title 2</h2>
</div>
Here, the div and h2 are fixed tags, but in order to assign different styles to different titles, each <h2> tag has different id and class attributes. The id and class can be set freely (or automatically generated according to some rules), which creates difficulties for general crawling. However, if parsed into a DOM tree, these specific markers are automatically ignored, and only the universal div and h2 markers are retained.
Another benefit of parsing into a DOM tree is the ability to describe the hierarchical structure of a webpage, allowing us to traverse an HTML page in an orderly manner and divide the crawled content through the division of different page levels.
Standard Template
For the same website, we first download several pages and parse them into DOM trees. Then, we represent each node or leaf in the form of "tag path + tag text". For the HTML page shown earlier, its website title can be represented as "html_head_title_This is the Website Title", and its article title can be represented as "html_body_div_h2_This is Article Title 1". Note that not every tag has text content; tags like <img> generally do not have text, and here we only consider tags with text.
At this point, every webpage is represented as a collection of tags. We can define the website's standard template as:
\begin{equation}S = \bigcap_{h\in H} h\end{equation}
Where $H$ is the collection of all webpages of that website. In other words, the standard template is the intersection of all web pages. At this point, extracting the effective text of a webpage is very simple:
\begin{equation}\text{effective} = h\backslash S\end{equation}
That is to say, the effective text of page $h$ is the set difference between set $h$ and set $S$.
Through this scheme, we can automatically build a standard template for any designated website, thereby extracting the effective text of the page. This process is valid based on the assumption that "content that is identical across web pages is meaningless, while content that differs is what is worth crawling." In this way, regardless of the website's hierarchical structure or whether there are embedded advertisements, we can effectively filter out the invalid text. Thus, we have obtained a universal crawling scheme.
Stream Generation
Although we define the standard template as the intersection of all pages, we do not need to crawl all pages before obtaining the standard template. At the beginning, we can use only two pages and take their intersection as the standard template. Then, every time a new page is crawled, we can update the standard template, making it increasingly accurate. This stream-based scheme meets the requirements of a production environment.
Rule Filtering
On some websites, different pages wrap different codes; therefore, crawling using the previous scheme will also save these codes at the same time, reducing the precision of the effective text. Therefore, some filtering rules are needed to filter them out.
Engineering Rules
From a production perspective, people are usually willing to sacrifice some precision in exchange for higher speed. Therefore, quite obviously, the easiest method to implement for engineering is to count the ratio of Chinese to English characters in each segment of effective text obtained earlier. Then, based on the assumption that "webpage content is basically Chinese," we can confidently remove parts with a low proportion of Chinese.
Of course, this will also remove information like dates and usernames. Therefore, we should first identify dates and usernames to prevent accidental deletion. For dates, the patterns are relatively fixed, such as "2017-04-15", "April 15, 2017", etc., and they are usually accompanied by words like "Release Time" or "Published on". Thus, they are relatively easy to identify. For usernames, they are typically a combination of Chinese, English, numbers, and underscores, without allowing spaces; coincidentally, code contains many spaces. Therefore, based on this filter, usernames can be identified in advance.
So, the overall rules are:
1. Text with date formats needs to be preserved;
2. Text composed purely of a combination of Chinese, English, numbers, and underscores needs to be preserved;
3. In the remaining parts, if the proportion of Chinese is greater than a certain threshold, it needs to be preserved;
4. All other content is filtered out directly.
Academic Scheme
Engineering rules are simple and efficient but only applicable to specific situations. For example, the above rules are suitable for crawling general Chinese websites, but if it's an English website, these rules are not suitable.
In general, this is actually a problem of determining whether a segment of text is natural language. We can use language models. A language model models the probability of a sentence. A sentence consists of $n$ words $w_1, w_2, \dots, w_n$. We consider the probability:
\begin{equation}P(w_1,w_2,\dots,w_n)\end{equation}
According to the formulas of probability theory, we get:
\begin{equation}P(w_1,w_2,\dots,w_n)=P(w_1)P(w_2|w_1)P(w_3|w_1,w_2)\dots P(w_n|w_1,w_2,\dots,w_{n-1})\end{equation}
The right side of the above equation is generally very difficult to estimate accurately. To this end, we make a truncation: the probability of a subsequent word only depends on the probability of the previous word. In this way, we get:
\begin{equation}\label{eq:score}P(w_1,w_2,\dots,w_n)=P(w_1)P(w_2|w_1)P(w_3|w_2)\dots P(w_n|w_{n-1})\end{equation}
In this way, we only need to count the probability of each word $P(w_i)$ and the co-occurrence probability of adjacent words $P(w_{i-1}, w_{i})$ to calculate the transition probability:
\begin{equation}P(w_i|w_{i-1}) = \frac{P(w_{i-1}, w_{i})}{P(w_{i-1})}\end{equation}
Once we have the transition probabilities, we can use formula \eqref{eq:score} to score each segment of text and set a threshold. Only when it is greater than this threshold do we conclude it is natural language, thereby preserving it. This scheme is simultaneously applicable to all languages (requiring only the language model of the corresponding language) and is theoretically a more elegant model.