By 苏剑林 | January 09, 2018
Scientific Spaces is a blog built using the Typecho program. The sidebar provides a search function; however, Typecho's built-in search is merely a string-based exact match search. As a result, many reasonable queries fail to return results—for example, "2018天象" (2018 Astronomical Phenomena) or "新词算法" (New Word Algorithm) cannot produce results simply because those exact strings do not appear in the articles.
This led to the idea of strengthening the search function, an improvement that some readers had previously suggested. Over the past few days, I did some research. Initially, I planned to use the Whoosh library in Python to build a full-text search engine, but I felt the workload for integration and future maintenance was too high, so I abandoned that path. Later, I thought about enhancing Typecho's own search directly. With the help of a colleague (a "big shot") at my company, I completed this improvement.
Since the improvement is implemented by directly modifying Typecho's source files, it might be overwritten if Typecho is upgraded. Therefore, I am making a note of it here as a memo.
Through searching on Github, I discovered that Typecho's search function is implemented in var/Widget/Archive.php, specifically around lines 1185–1192:
if (!$hasPushed) {
$searchQuery = '%' . str_replace(' ', '%', $keywords) . '%';
/** Search cannot enter protected archives */
$select->where('table.contents.password IS NULL')
->where('table.contents.title LIKE ? OR table.contents.text LIKE ?', $searchQuery, $searchQuery)
->where('table.contents.type = ?', 'post');
}
Evidently, search results are returned by matching keywords in SQL, where % is the SQL wildcard character. Consequently, we also find that if the search query we input contains spaces, those spaces are replaced with wildcards, making the search a bit more flexible.
Therefore, a natural thought is that regardless of whether the query contains spaces, we can manually perform word segmentation on the query and then connect the segments with wildcards. This allows for more flexible searching even when no spaces are provided. This was indeed the first approach I practiced. However, the problem with this method is that even after segmentation, the system still requires all words to match to produce a result. If even one word has never appeared in the blog, no match will be found. To do better, one needs to consider a method where each word is a candidate rather than a requirement.
To achieve the aforementioned goal, I wrote an HTTP interface in Python and placed it on the server. This HTTP interface is responsible for word segmentation and generating the SQL statement. I then replaced $keywords = $this->request->filter('url', 'search')->keywords; with $keywords = $this->request->keywords; and rewrote the code mentioned above to:
if (!$hasPushed) {
$url = 'http://127.0.0.1:7777/token?text=' . $keywords;
$url = str_replace(' ', '%20', $url);
$searchQuery = file_get_contents($url);
/** Use simple exact match if the interface fails */
if (!$searchQuery) {
$searchQuery = 'SIGN(INSTR(table.contents.title, "' . $keywords . '"))';
$searchQuery = $searchQuery . ' + SIGN(INSTR(table.contents.text, "' . $keywords . '"))';
}
/** Search cannot enter protected archives */
$select->where('table.contents.password IS NULL')
->where($searchQuery . ' > 0')
->where('table.contents.type = ?', 'post')
->order($searchQuery, Typecho_Db::SORT_DESC);
}
The interface at http://127.0.0.1:7777/token?text= is a Python program:
#! -*- coding:utf-8 -*-
import bottle
import jieba
jieba.initialize()
def convert(s):
ws = jieba.cut(s)
search = []
for i in ws:
search.append('2*SIGN(INSTR(table.contents.title, "%s"))'%i)
search.append('SIGN(INSTR(table.contents.text, "%s"))'%i)
return '(%s)'%(' + '.join(search))
@bottle.route('/token', method='GET')
def token_home():
text = bottle.request.GET.get('text')
if not text:
text = ''
return convert(text)
if __name__ == '__main__':
bottle.run(host='0.0.0.0', port=7777, server='gunicorn')
This interface returns the scoring part of the SQL statement. The specific algorithm is: first perform word segmentation; if an article title contains a word, it adds 2 points; if the article content contains a word, it adds 1 point. Finally, a total score is calculated. The functions used, such as SIGN and INSTR, can be easily looked up. I highly recommend using bottle, a lightweight library, as it makes writing HTTP interfaces very convenient.
Another modification required is: because we modified the PHP part to use order($searchQuery, Typecho_Db::SORT_DESC); in hopes of sorting by score in descending order. However, this will not take effect immediately because Typecho defaults to sorting everything by time in descending order. Therefore, we must also modify lines 1396–1397 of the same file, changing the original:
$select->order('table.contents.created', Typecho_Db::SORT_DESC)
->page($this->_currentPage, $this->parameter->pageSize);
to:
if (strpos($select, 'INSTR') === false) {
$select->page($this->_currentPage, $this->parameter->pageSize)
->order('table.contents.created', Typecho_Db::SORT_DESC);
} else {
$select->page($this->_currentPage, $this->parameter->pageSize);
}
The general idea is to check if it is a search statement. If it is, then do not sort by time; if it isn't, then sort by time. Simply removing the time-based sorting is not viable because this line also controls the homepage output, and the homepage must be sorted by time.
Why use this hybrid Python and PHP solution instead of writing it purely in PHP? It's true that a pure PHP version is possible—there is indeed a PHP version of Jieba segmentation—but the most important reason is that I don't know PHP! Furthermore, the PHP version of Jieba requires additional configuration, which is a bit troublesome. Using Python is much simpler for me; if I need to make improvements, I can just modify the Python script.
Finally, some users might worry that such a "brute-force" solution might have efficiency issues. In fact, if there were hundreds of thousands of articles, the above approach would certainly have serious efficiency problems. However, for a blog with only a few hundred articles, this issue does not need to be considered.
At last, I can search more freely. Further suggestions are welcome.
If you found this article helpful, you are welcome to share or reward it. Rewards are not for profit, but to let me know how many readers truly care about Scientific Spaces. Of course, even if you ignore it, your reading experience will not be affected. Welcome and thank you once again!