PyCon DE & PyData 2026

AI Evals Done Right: From Vibes to Confident Decisions
, Platinum [2nd Floor]

Testing traditional software is "simple"... same input, same output. LLMs? Not so much. Same prompt, different result every time. So how do you actually know if your AI product is good?

Most teams struggle with this. Generic metrics like "Helpfulness: 4.2" sound scientific but don't drive real decisions. And when a new model releases, it's weeks of debates instead of data.

This talk introduces Error Analysis: a methodology to discover the concrete failure modes of your AI product and turn them into measurable evals. You'll learn how to build a failure taxonomy that enables real prioritization. Which issues are critical? Which are frequent? What should developers fix next, and how do you measure success?

The payoff: A real quality number for stakeholders. Concrete improvement tasks for developers. And when a new model drops, a ship-or-skip decision within 24 hours based on actual data.

Expect a meme-powered walkthrough, real-world examples from production, and a clear path to implement this yourself starting with just 20 traces.


Testing traditional software is "simple"... same input, same output. LLMs? Not so much. Same prompt, different result every time. So how do you actually know if your AI product is good?

Spoiler: Most teams don't. They ship on vibes and hope for the best.

This talk takes you through our real journey at Blue Yonder, where we built an LLM-powered analytics system and needed a way to actually measure its quality. You'll see how we went from "feels okay-ish" to concrete numbers that let us make real decisions - with actual examples from production along the way.

The methodology is called Error Analysis: collect traces, annotate them from the user's perspective, group similar issues into failure modes, and turn those into automated evals. Along the way, we'll share practical best practices like why binary Pass/Fail beats rating scales, and why 100% pass rate means your evals are broken.

The payoff? When a new model drops, we run our pipeline and know within hours - not weeks - whether it's better or worse for our specific use case. Real percentages. Real trade-offs. Real decisions.

Expect a meme-powered walkthrough and a clear path to implement this yourself starting with just 20 traces.

Outline:
- Introduction: The challenge of testing stochastic systems, why we needed a better approach
- Collecting and Annotating Traces: Every trace is a user experiencing your product, Open Coding from the user perspective, real examples of failure modes we discovered
- Building the Failure Taxonomy: Grouping observations into categories, Axial Coding, turning scattered comments into actionable failure modes
- Writing Evals That Work: LLM-as-judge setup, binary scores vs rating scales, validating against human judgment
- From Vibes to Decisions: Prioritizing what to fix, measuring improvement, 24-hour model benchmarking
- Wrap-up: Your action plan, start with 20 traces


Expected audience expertise in your talk's domain:: Intermediate Expected audience expertise in Python:: None

Martin Seeler supercharges global supply chains with GenAI as Sr Staff AI Engineer at Blue Yonder. He ships AI that survives angry customers, skeptical executives, and Black Friday traffic. Speaks globally about the messy reality of production AI. Measures success in customer value delivered.