PyConDE & PyData Berlin 2024

Guillaume Lemaitre

I have a PhD in computer science and have been a scikit-learn and imbalanced-learn core developer since 2017. I am currently an open-source engineer helping at the maintenance of these tools.


X / Twitter handle

@glemaitre58

Github

https://github.com/glemaitre

LinkedIn

https://www.linkedin.com/in/guillaume-lemaitre-b9404939/


Session

04-22
14:35
30min
A Retrieval Augmented Generation system to query the scikit-learn documentation
Guillaume Lemaitre

The scikit-learn website currently employs an "exact" search engine based on the Sphinx Python package, but it has limitations: it cannot handle spelling mistakes and queries based on natural language. To address these constraints, we experimented with using large language models (LLMs) and opted for a retrieval augmented generation (RAG) system due to resource constraints.

This talk introduces our experimental RAG system for querying scikit-learn documentation. We focus on an open-source software stack and open-weight models. The talk presents the different stages of the RAG pipeline. We provide documentation scraping strategies that we designed based on numpydoc and sphinx-gallery, which are used to build vector indices for the lexical and semantic searches. We compare our RAG approach with an LLM-only approach to demonstrate the advantage of providing context. The source code for this experiment is available on GitHub: https://github.com/glemaitre/sklearn-ragger-duck.

Finally, we discuss the gains and challenges of integrating such a system into an open-source project, including hosting and cost considerations, comparing it with alternative approaches.

PyData: Generative AI
A1