BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.com//pyconde-pydata-2026//speaker//9FLCR9
BEGIN:VTIMEZONE
TZID:CET
BEGIN:STANDARD
DTSTART:20001029T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-pyconde-pydata-2026-MQJVFU@pretalx.com
DTSTART;TZID=CET:20260416T150500
DTEND;TZID=CET:20260416T153500
DESCRIPTION:Testing traditional software is "simple"... same input\, same o
 utput. LLMs? Not so much. Same prompt\, different result every time. So ho
 w do you actually know if your AI product is good?\n\nMost teams struggle 
 with this. Generic metrics like "Helpfulness: 4.2" sound scientific but do
 n't drive real decisions. And when a new model releases\, it's weeks of de
 bates instead of data.\n\nThis talk introduces Error Analysis: a methodolo
 gy to discover the concrete failure modes of your AI product and turn them
  into measurable evals. You'll learn how to build a failure taxonomy that 
 enables real prioritization. Which issues are critical? Which are frequent
 ? What should developers fix next\, and how do you measure success?\n\nThe
  payoff: A real quality number for stakeholders. Concrete improvement task
 s for developers. And when a new model drops\, a ship-or-skip decision wit
 hin 24 hours based on actual data.\n\nExpect a meme-powered walkthrough\, 
 real-world examples from production\, and a clear path to implement this y
 ourself starting with just 20 traces.
DTSTAMP:20260412T142015Z
LOCATION:Platinum [2nd Floor]
SUMMARY:AI Evals Done Right: From Vibes to Confident Decisions - Martin See
 ler
URL:https://pretalx.com/pyconde-pydata-2026/talk/MQJVFU/
END:VEVENT
END:VCALENDAR