BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.com//pydata-london-2026//talk//MMS9WY
BEGIN:VTIMEZONE
TZID:GMT
BEGIN:STANDARD
DTSTART:20001029T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:GMT
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T020000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:BST
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-pydata-london-2026-MMS9WY@pretalx.com
DTSTART;TZID=GMT:20260607T110000
DTEND;TZID=GMT:20260607T114500
DESCRIPTION:We rely on dashboards to tell us if our RAG system is working. 
 But most standard metrics\, Cosine Similarity\, BLEU\, and even BERTScore\
 , are fundamentally broken for measuring factual correctness. They measure
  text overlap or semantic drift\, not truth.\n\nThis means you can have a 
 "90% Accurate" system on paper that hallucinates dangerous misinformation 
 in production. This talk dismantles the current state of RAG evaluation. W
 e will look at why "Golden Datasets" are often contaminated\, why "LLM-as-
 a-Judge" is biased towards its own output\, and how to build a robust\, ad
 versarial evaluation pipeline that actually catches failures before your u
 sers do.
DTSTAMP:20260602T223329Z
LOCATION:Grand Hall 1
SUMMARY:The Silent Crash: Why Your RAG Evaluation Metrics Are Lying to You 
 - Hitendri Bomble\, Arghyadeep Sarkar
URL:https://pretalx.com/pydata-london-2026/talk/MMS9WY/
END:VEVENT
END:VCALENDAR
