2024-04-22 –, A1
The scikit-learn website currently employs an "exact" search engine based on the Sphinx Python package, but it has limitations: it cannot handle spelling mistakes and queries based on natural language. To address these constraints, we experimented with using large language models (LLMs) and opted for a retrieval augmented generation (RAG) system due to resource constraints.
This talk introduces our experimental RAG system for querying scikit-learn documentation. We focus on an open-source software stack and open-weight models. The talk presents the different stages of the RAG pipeline. We provide documentation scraping strategies that we designed based on numpydoc and sphinx-gallery, which are used to build vector indices for the lexical and semantic searches. We compare our RAG approach with an LLM-only approach to demonstrate the advantage of providing context. The source code for this experiment is available on GitHub: https://github.com/glemaitre/sklearn-ragger-duck.
Finally, we discuss the gains and challenges of integrating such a system into an open-source project, including hosting and cost considerations, comparing it with alternative approaches.
Currently, the scikit-learn website provides an "exact" search engine based on the tools provided by the Sphinx Python package (i.e., https://www.sphinx-doc.org/). The current search engine is implemented in JavaScript and runs locally using an index built when generating the documentation. This solution has the advantage of being lightweight and does not require any server to handle the query. However, the complexity of the query treated is weak: since the search is "exact," it is not robust to spelling mistakes, and the search is intended for searches based on keywords.
As large language models (LLMs) are becoming more popular, we have been interested in experimenting with this technology, knowing that they could address some of the previously stated limitations. As an open-source project, we have limited resources in terms of compute and limited available datasets; therefore, we discarded the option of fine-tuning an LLM and leaned towards retrieval augmented generation (RAG) systems.
This talk presents an experimental RAG system developed to query the scikit-learn documentation. As constraints, we impose ourselves to use an open-source software stack and open-weight models to build our system. The talk is decomposed as follows:
First, we provide some background on the RAG system and the pipeline to follow to implement such a system.
Then, we go into details in the different stages of the RAG pipeline. We provide some insights regarding documentation scraping strategies that we developed by leveraging the numpydoc
and sphinx-gallery
parser. Then, we discuss the solution that we tested to perform lexical and semantic searches. Finally, we explain how the context found can be fed to the LLM to help generate an answer to the user query. We provide a small demo to compare queries performed on an LLM-only system and on the developed RAG system. All the code for the experiment is hosted at the following GitHub repository: https://github.com/glemaitre/sklearn-ragger-duck.
Finally, we put into perspective the gains and pains of such an RAG system when it comes to integrating it into an open-source project. Notably, we question the hosting and cost of such systems and compare it with other approaches that could tackle some of the original issues.
Intermediate
Expected audience expertise: Python:Intermediate
Abstract as a tweet (X) or toot (Mastodon):A Retrieval Augmented Generation system to query the scikit-learn documentation
Public link to supporting material, e.g. videos, Github, etc.:I have a PhD in computer science and have been a scikit-learn and imbalanced-learn core developer since 2017. I am currently an open-source engineer helping at the maintenance of these tools.