2026-06-07 –, Grand Hall 1
We rely on dashboards to tell us if our RAG system is working. But most standard metrics, Cosine Similarity, BLEU, and even BERTScore, are fundamentally broken for measuring factual correctness. They measure text overlap or semantic drift, not truth.
This means you can have a "90% Accurate" system on paper that hallucinates dangerous misinformation in production. This talk dismantles the current state of RAG evaluation. We will look at why "Golden Datasets" are often contaminated, why "LLM-as-a-Judge" is biased towards its own output, and how to build a robust, adversarial evaluation pipeline that actually catches failures before your users do.
Picture this: You’ve just finished your RAG pipeline. The test dashboard is all green, Context Recall is 85%, Answer Relevance is 92%. You deploy with confidence. Ten minutes later, a user asks a simple question, and the bot confidently gives the wrong answer.
Why did the metrics pass? Because similarity is not correctness. To a vector database, "The treatment is safe" and "The treatment is not safe" look nearly identical, they share the same words and sentence structure. But logically, they are opposites. Standard metrics like Cosine Similarity or BLEU often completely miss these critical negations.
In this talk, we are going to stop relying on "vibe checks" and start treating Evaluation as a software testing problem. We’ll look at why traditional NLP metrics are useless for RAG and move toward the new standard: LLM-as-a-Judge. We will discuss the messy reality of using GPT-4 to grade Llama-3, how to catch "Self-Preference Bias" (where models just like their own writing style), and how to do all of this without bankrupting your API budget.
Outline
- Real-world examples where high metrics hid major failures, and why "Finding the doc" (Retrieval) is different from "Answering the question" (Generation).
- Why Your Metrics Are Broken: Why Cosine Similarity is good for search but bad for truth, and why BLEU scores punish correct answers just for using different synonyms.
- Using models (like G-Eval) to grade logic and tone, and solving the "Judge Paradox" by swapping options to remove Position Bias.
- Building a "Hard" Test Set: How to stop testing on easy questions and generate adversarial "Trick Questions" that specifically target your retrieval gaps.
- Key Takeaways: A practical strategy for using metrics, plus a look at tools like Ragas and DeepEval.
Hitendri Bomble is a Senior Data Scientist at Red Hat, where she builds Generative AI solutions to solve complex business problems. She specializes in working with Large Language Models (LLMs) to create tools that make everyday work more efficient. Deeply rooted in the open-source community, Hitendri focuses on using the latest AI innovations to automate tasks and bring fresh ideas to her team.
Arghyadeep Sarkar is a Senior Data Scientist at Red Hat with ~8 years of experience in data science and artificial intelligence. His career has evolved from traditional machine learning to architecting large-scale Generative AI and LLM-based production systems.
He built strong foundations in statistical modeling, ML pipelines, and applied AI, later specializing in deep learning, NLP, transformers, and Generative AI. He has designed and deployed LLM agents, RAG-based systems, and enterprise conversational platforms, covering the full lifecycle from training and fine-tuning to scalable deployment.
Current Focus
- Building reliable agentic AI systems
- Improving retrieval grounding and RAG quality
- Deploying LLMs and SLMs in production
- Delivering scalable, cost-efficient enterprise AI solutions
He brings a system-first engineering mindset, translating cutting-edge AI research into robust real-world products.