Lessons learned in bringing a RAG chatbot with access to 50k+ diverse documents to production PyCon DE & PyData 2025

Lessons learned in bringing a RAG chatbot with access to 50k+ diverse documents to production
.ical

2025-04-24 16:55–17:40, Titanium3

Retrieval-Augmented Generation (RAG) chatbots are a key use case of GenAI in organizations, allowing users to conveniently access and query internal company data. A first RAG prototype can often be created in a matter of days. But why are the majority of prototypes still in the pilot stage? [1]

In this talk we share our insights from developing a production-grade chatbot at Merck. Our RAG chatbot for R&D experts accesses over 50,000 documents across numerous SharePoint sites and other sources. We identified three technical key success factors:
1. Building a robust data pipeline that syncs documents from source systems and that handles enterprise features such as replicating user permissions.
2. Developing a chatbot workflow from user question to answer with retrieval components such as hybrid search and reranking
3. Establishing a comprehensive evaluation framework with a clear optimization metric.

We think that many of these lessons are broadly applicable to RAG chatbots, making this talk valuable for practitioners aiming to implement GenAI solutions in business contexts.

Building a prototype RAG chatbot with frameworks like LangChain can be straightforward. However, scaling it into a production-grade application introduces complex challenges. In this talk, we share our lessons learned from developing a RAG chatbot designed to assist research and development (R&D) experts.

Our chatbot was developed to effectively handle and provide access to a large collection of unstructured knowledge, consisting of over 50,000 documents stored across more than 20 SharePoint sites and other sources. We faced significant hurdles in:
- Data Pipeline Engineering: Crafting a modular and scalable pipeline capable of periodically syncing documents, handling dynamic user permissions, and efficiently processing large volumes of unstructured data.
- RAG Design and Prompting Strategies: Addressing challenges in document chunking, citation integration, reranking retrieved results, and applying permission and PII filters to ensure compliance and accuracy in responses.
- Evaluation Framework Development: Implementing an effective testing strategy without the availability of static ground truth data. We employed automated testing with frameworks like pytest, utilized LLM-as-a-judge, and integrated tracing to iteratively refine our dataset and maintain high answer quality.
- User Adoption: Driving user adoption through onboarding training and ongoing engagement, such as regular office hours and feedback mechanisms.

We emphasize the importance of applying data science principles to GenAI projects:
- Start Simple and Iterate: Begin with a basic implementation as a baseline and iteratively enhance functionality based on testing and user feedback.
- Test-Driven Development: Identify key test scenarios early and use them to drive development, ensuring that improvements are measurable and aligned with growing user needs.
- Focus on Key Metrics: Establish clear metrics to optimize against, aiding in making informed decisions throughout the development process.

Main Takeaways for the Audience:
- Understand the critical role of robust, modular data pipelines in handling dynamic and unstructured data sources for LLM applications.
- Learn strategies for developing effective evaluation frameworks in complex domains where traditional ground truth data may be lacking.
- Gain insights into advanced RAG design techniques that enhance chatbot performance and reliability.
- Recognize the substantial data engineering and software development efforts required to transition a prototype to a production-grade LLM solution.

By sharing our experiences, attendees will gain practical insights into deploying robust RAG chatbots, transforming a functional prototype into a reliable, scalable application that fulfills enterprise requirements.

Expected audience expertise: Domain:

Advanced

Expected audience expertise: Python:

Advanced

See also: Slides Lessons learned Productive RAG chatbot (3.8 MB)

Bernhard Schäfer

Bernhard is a Senior Data Scientist at Merck with a PhD in deep learning and over 5 years of experience in applying data science and data engineering within different industries. For more information you can connect with him on LinkedIn. 🙂

Nico Mohr

Nico works as a Senior Machine Learning Engineer at Merck, focusing on developing applications powered by LLMs. His background bridges software engineering and data science, with experience spanning classical data science, computer vision, and discrete optimization, where he has deployed several machine learning solutions in production environments.

Lessons learned in bringing a RAG chatbot with access to 50k+ diverse documents to production .ical 2025-04-24 16:55–17:40, Titanium3

Lessons learned in bringing a RAG chatbot with access to 50k+ diverse documents to production
.ical

2025-04-24 16:55–17:40, Titanium3