Have you ever struggled with a poorly documented software project? What about a well documented project but you can’t find the right section inside the docs? The Read the Docs core team has realized the importance of good search for documentation and got me to take the challenge as a Google Summer of Code student. The main goal of my GSoC project was to refactor the search code together with upgrading the backend search engine, as well as adding more features to our search functionality like exact match search, case insensitive search, search as you type, suggestions and more.
Google Summer of Code¶
Google Summer of Code is a global program where students work with an open source organization on a 3 month programming project. The core team of Read the Docs proposed some Project Ideas, one of them was Refactor & improve our search code. I (Safwan Rahman) was keen to get my hands dirty with Elasticsearch and grasped the opportunity to do so by applying for this project and I got accepted.
I have worked full time for from April to August to upgrade the whole codebase to compatible with Elasticsearch 6.x and also implemented various features like:
Together with this, we are planning more features like the following:
All of my search related work can be seen in the Search Project Board.
Search is a vital part of any documentation hosting platform, so people can get the information they need. As a documentation hosting platform, the same rules apply to Read the Docs. Because of having a small core team, the search functionality of Read the Docs has lagged behind for quite a while now. Initially the search code was voluntarily contributed by Rob Hudson, back in 2013 and then improved by other contributors. The search infrastructure was already outdated as Read the Docs were using Elasticsearch 1.3.x which was already reached its End of Life in 2016. Therefore, Upgrading the search infrastructure was badly needed.
Built in Search vs Read the Docs Search¶
Both Sphinx and MkDocs already have built in search functionality. But the features are very limited. At Read the Docs, we have felt the limitations and therefore we index the documentations in our Elasticsearch index so that we can provide better search experience like:
Search across multiple projects
Advanced query syntax
Search inside subprojects
Improved search result order
Public Search API (Documentation pending)
In the 4 months of full time work, I have implemented these features along with many bug fixes. Some of the major features are as following:
Exact Matching Search: Exact matching is one of our most highly requested features for Read the Docs. Now you can search for exact word in documentation by having your query inside quotation (
""). So if you search for
"Here is foo"(with the quotation), you will get all the documentation where the full
Here is foophrase exists.
Simple Query String Syntax support: Now you can use Simple Query String Syntax for searching. For example, if you search with
Mozilla +-Firefox, you will get documentation where the
Mozillaword is present, but
Firefoxword is not present. For more information, you can look at the Simple Query String Syntax documentation.
Case Insensitive Search: The search is now case insensitive. So if you search for Foo, you will get all the documentation which have one of either
Improved Result order: The result order is improved dramatically. Therefore if you Search with
Foo Bar, you will first see the documentation which have both
Foo Bar, then you will see other documentation which have either
Auto Removing from Search Index: In the past, if any page got removed from documentation, the page was still available in documentation. But from now on, the page will be removed automatically from the search index as soon as its no longer in the documentation.
Zero Downtime Indexing: As search is an important part of documentation, there will be no downtime while we reindex our Elasticsearch Index. So the search will be much more reliable than before.
Code quality is very important in development world, specially in open source. As I have rewritten the search functionality from scratch, the code quality is improved in many ways like test coverage and documentation. So its easy for any contributor to start working on the search functionality
As Read the Docs is an open source project backed by a small team of developers, most of them are busy to keep things up and running only. Therefore, its quite hard for them to take time to implement new features. If you know some bit of Django or Python and Elasticsearch, you can contribute into the search functionality of Read the Docs. If you need any support to start contributing, you can get in touch with me or any member of Read the Docs team. You can find all of us at #readthedocs freenode IRC channel or readthedocs gitter channel. I am safwan at IRC and @safwanrahman at gitter.
To conclude, I must say that the Search improvement in Read the Docs was very necessary and I’m glad I could improve it in such a short amount of time. There are an infinite number of ways it can be improved and I believe we can compete with major search engines in terms of documentation searching. Due to the constraints of only working for three months, a number of compelling features were left out such as Search as You Type and Autocomplete and Code Search functionality. Moreover, proper documentation is needed for the search architecture. I have tried to write test cases for most of the scenario, but because of time constrains, a lot of code is out of test coverage.
I strongly hope that we will get the left behind work done within a short amount of time. This can be done easily if we get more contributors donate their time for improving Read the Docs. We don’t need superhero or coding guru, just need people who understand Python, Django and Elasticsearch and have some time to write some code for us. You are a Superhero to us if you can lend your time and effort to improve Read the Docs.