PyConDE & PyData Berlin 2024

Put your RAG to the test: Component-per-component evaluation of our LLM-powered airplane manufacturing assistant
04-22, 16:10–16:55 (Europe/Berlin), A1

Your RAG-powered LLM application might look pretty convincing at first glance, but how do you really know if it’s any good? And how do you justify the design choices you make? In this talk, you will learn about the RAG evaluation concept we produced at Airbus for evaluating the components of our digital engineering assistant, its implementation with open source tools paired with Google Vertex AI, and what we learnt in the process.


Nowadays, Retrieval Augmented Generation (RAG) architecture has become quite the standard approach for building high-quality document search products or personal assistant applications. Prototyping a RAG application might yield quite convincing results from the very first stages of development, but how do you know if it’s really any good when you move your application from prototype into production? And how do you justify the design choices you make? For example, do you know if long-context models would perform better than short-context models with chunking for long-form documents you have at hand? Or, what difference does it make if you keep your different types of documents in one index or in separate ones? Or, is usage of few-shot learning really worth it for your use case, given that adding examples can increase the cost dramatically compared to zero-shot learning? And of course, how do you know there isn’t a better prompt out there for making the LLM do exactly what you expect it to?

At Airbus, we went through this thought process during the development of a RAG-based assistant for creation of assembly manuals - documents which help our colleagues in Manufacturing navigate through the airplane parts construction procedures. For answering these and other questions, we produced an evaluation concept for our Generative AI applications, which relies on different methods and metrics for RAG evaluation end-to-end and testing each of its components separately. In this talk, we will present our evaluation concept, how we implemented it with tools like LangChain and Ragas, what metrics we use and how we conduct our experiments with the help of Google Vertex AI Pipelines.


Expected audience expertise: Domain

Intermediate

Expected audience expertise: Python

Novice

Abstract as a tweet (X) or toot (Mastodon)

This talk discusses the topic of component-wise evaluation of RAG-based applications on the example of the airplane manufacturing assistant developed at Airbus using open source Python libraries paired with Google Vertex AI.

I am a Data Scientist at Airbus, where I am a part of the team Digital, building AI products which empower engineering, manufacturing, sales and other business activities of the company. I enjoy diving deep into natural language processing and am passionate about MLOps, good coding practices and deploying AI applications in the cloud. Apart from that, I teach Python, and in my free time, I enjoy hiking and learning new languages.