Cool Papers Update: Simply Built an In-site Search System

By 苏剑林 | May 07, 2024

Since "A More Convenient Way to Open Cool Papers: Chrome Redirect Extension", Cool Papers has undergone two major changes. One is the introduction of the venue branch, which is gradually including the paper sets of various conferences over the years, such as ICLR, ICML, etc. This part is manually expanded dynamically, and readers are welcome to suggest more of their favorite conferences. The other change is the subject of this article: the new in-site search function added the day before yesterday.

This article will briefly introduce the new features and summarize the process of building the in-site search system.

Introduction

On the homepage of Cool Papers, we see the search entry:

Cool Papers (2024.05.07)

The features of the search function are as follows:

Only searches the 'title' and 'summary' fields; specifying other fields is not yet supported.

You can specify to search either the arxiv branch or the venue branch; mixed searching across branches is not supported.

Special characters (non-English letters and numbers) in the search query will be removed.

Search query words are not automatically stemmed, meaning searching for "images" will not match "image".

On the search results page, it can be used in combination with the original in-page search function.

In general, it is currently a very simple text search function, meant to satisfy the basic needs of some users first. For more complex requirements, updates will be rolled out gradually. Planned features to be introduced include specifying fields, searching Kimi FAQ content, sorting by stars, specifying dates/categories (for arxiv), specifying conferences (for venue), and even enabling additive/subtractive logic (to exclude certain keywords) like a standard search engine. These will depend on follow-up feedback from users; there is no fixed schedule yet.

Summary

In fact, the demand for in-site search has been raised by users since Cool Papers was opened to the public at the beginning of the year. The reason it was delayed until now is mainly because Cool Papers collects papers daily, and at the beginning, the number of papers was small, so in-site search was not very meaningful. After more than four months of accumulation, the number of Arxiv papers collected by Cool Papers has reached over 80,000, and along with the conference papers in the venue branch, it has also reached over 80,000. With nearly 170,000 papers currently, it is worth searching through.

Once it was decided it could be done, the next step was how to do it. A search system that retrieves article content based on keywords is called "Full-text Search." Generally, these are built based on inverted indexes and BM25 similarity, which means the algorithms are mature. In terms of implementation, the backend of Cool Papers is BottlePy, so we had to find a full-text search library available for Python to conveniently integrate it into Cool Papers.

There aren't many choices for the "Python + Full-text Search" combination. The most classic is a library called Whoosh. From a functionality standpoint, it could indeed meet the needs of Cool Papers. However, the problem with Whoosh is that it hasn't been updated since April 2016, leading to concerns about potential hidden issues. Another option is to directly switch to a database with full-text search capabilities, such as MongoDB. If data had been stored in MongoDB from the start, this would undoubtedly be the simplest solution. However, Cool Papers chose to use Python's built-in key-value database, Shelve. Switching to MongoDB now would involve too much engineering work, and for the simple scenarios of Cool Papers, MongoDB's speed wouldn't match Shelve.

After multiple fruitless searches, I unexpectedly discovered a tiny but powerful alternative to Whoosh—tantivy. This is a full-text search library written in Rust, but it provides Python bindings so it can be used as a Python library. Its API is similar to Whoosh, and it is still being actively updated. As is well known, Rust is famous for its efficiency, so one could say tantivy fulfills all my ideals for a full-text search library—speed, small footprint, and simplicity.

After selecting the full-text search library, the remaining task was the front-end work. In "Happy New Year! Recording the Development Experience of Cool Papers", I mentioned that I am an absolute front-end novice with no artistic sense. Designing a UI is extremely difficult for me; I could only rely on constant searching, copy-pasting, and seeking help from GPT-4 and Kimi. After various bits of "cobbling together" and "patching up," I finally managed to create a usable interface. During the development process, I also optimized the original built-in page search; using the page search now should feel significantly faster.

Final Words

During the May Day holiday, while everyone was looking at KAN (Kolmogorov-Arnold Networks), I took a bit of a break. Instead of reading papers, I added the in-site search function to Cool Papers. I wouldn't say it "emerged after a thousand calls" (was long-awaited), but it is a feature that some users have been pushing for for a long time. Here is a brief introduction and an experience summary of the implementation.

If you found this article helpful, you are welcome to share or donate to this post. Donations are not about profit, but to let me know how many readers truly follow "Scientific Spaces." Of course, if you ignore it, it will not affect your reading. Welcome and thank you once again!

,
        author={Su Jianlin},
        year={2024},
        month={May},
        url={\url{https://kexue.fm/archives/10088}},
}