Evaluating the evaluator: RAG eval libraries under the loop PyData Paris 2024

Evaluating the evaluator: RAG eval libraries under the loop
.ical
2024-09-25 10:30–11:00, Louis Armand 1 - Est

Retrieval-augmented generation (RAG) has become a key application for large language models (LLMs), enhancing their responses with information from external databases. However, RAG systems are prone to errors, and their complexity has made evaluation a critical and challenging area. Various libraries (like RAGAS and TruLens) have introduced evaluation tools and metrics for RAGs, but these evaluations involve using one LLM to assess another, raising questions about their reliability. Our study examines the stability and usefulness of these evaluation methods across different datasets and domains, focusing on the effects of the choice of the evaluation LLM, query reformulation, and dataset characteristics on RAG performance. It also assesses the stability of the metrics on multiple runs of the evaluation and how metrics correlate with each other. The talk aims to guide users in selecting and interpreting LLM-based evaluations effectively.

Retrieval-augmented generation (RAG) has emerged as a common technique to augment LLMs. It provides more founded answers based on relevant content from databases and documents, thus offering a cheap and convenient alternative to fine-tuning and re-training.

However, a system that uses RAG can and will make mistakes, and the space of parameters one can tune in a RAG system is increasingly more complex. Evaluation has therefore understandably become a hot topic in the community. Many actors (e.g. RAGAS, TruLens, MLFlow) have developed libraries and RAG-specific metrics that allow for a faster, more principled and more easily scalable evaluation.
This comes with a catch though: in a way, you're asking an LLM (albeit an admittedly high-grade one) to evaluate another LLM.

To understand the usefulness and stability of RAG evaluation libraries, we have conducted an analysis of RAG evaluation results on two different datasets of corporate and technical documentation, representing different levels of technical difficulty.

Manual evaluation is a time consuming process and we only provide a limited comparison of LLM-based results to manual ones. Instead, we dedicate the bulk of the analysis to more systematic ways of measuring how stable the results are. We study for instance the impact of the evaluator LLM choice and query reformulation on retrieval and generation in RAG, and re-run LLMs on both sides (RAG and evaluation) multiple times to assess the bounds of the metrics. We also study the correlation between different evaluation metrics and the impact of the dataset, its domain and its size.

The goal of this talk is to provide the audience with the insights, caveats and better intuition they need when choosing, using and interpreting the results of LLM-based evaluation.

Some prior knowledge in RAG is useful, but not required.

Nour El Mawass

Nour leads the Generative AI technical group at Modus Create. She has a PhD in Machine Learning. and has worked on Machine Learning, Data Science and Data Engineering problems in various domains, both inside and outside Academia.

Maria Knorps

Maria's professional goal is to improve the environment by understanding it first in
the language of mathematics and then applying the gained knowledge.
After graduating in applied mathematics, Maria began research on the two-phase turbulent flows.
Knowledge of mathematical modeling helped her better understand the small-scale physical effects
and allowed her to more accurately model two-phase turbulence while reducing computational costs.

After completing her PhD, Maria began to work as a data scientist.
She was responsible for all stages of data processing, from creating ETL pipelines, through modeling
to visualization of the results and leading 2-5 people projects.
Her inclination towards the implementation and design aspects gravitated her towards functional programming.
She integrated Haskell into parts of the data processing pipelines, finding its type system and
expressiveness more akin to mathematical language. Maria is also dedicated to maintaining neat,
reusable, and well-documented code.

Outside of her technical pursuits, Maria is passionate about promoting diversity in the IT industry
and inspiring girls and women to engage in programming. Balancing her career with being a mother of three,
she finds limited but cherished time for personal hobbies. When the opportunity arises,
Maria enjoys the thrill of motorcycle rides beyond the city limits.

This speaker also appears in:

On the structure and reproducibility of Python packages - data crunch

Evaluating the evaluator: RAG eval libraries under the loop .ical 2024-09-25 10:30–11:00, Louis Armand 1 - Est

Evaluating the evaluator: RAG eval libraries under the loop
.ical
2024-09-25 10:30–11:00, Louis Armand 1 - Est