PyCon DE & PyData 2026

Martin Seeler

Martin Seeler supercharges global supply chains with GenAI as Sr Staff AI Engineer at Blue Yonder. He ships AI that survives angry customers, skeptical executives, and Black Friday traffic. Speaks globally about the messy reality of production AI. Measures success in customer value delivered.


Session

04-16
15:05
30min
AI Evals Done Right: From Vibes to Confident Decisions
Martin Seeler

Testing traditional software is "simple"... same input, same output. LLMs? Not so much. Same prompt, different result every time. So how do you actually know if your AI product is good?

Most teams struggle with this. Generic metrics like "Helpfulness: 4.2" sound scientific but don't drive real decisions. And when a new model releases, it's weeks of debates instead of data.

This talk introduces Error Analysis: a methodology to discover the concrete failure modes of your AI product and turn them into measurable evals. You'll learn how to build a failure taxonomy that enables real prioritization. Which issues are critical? Which are frequent? What should developers fix next, and how do you measure success?

The payoff: A real quality number for stakeholders. Concrete improvement tasks for developers. And when a new model drops, a ship-or-skip decision within 24 hours based on actual data.

Expect a meme-powered walkthrough, real-world examples from production, and a clear path to implement this yourself starting with just 20 traces.

PyData: Generative AI & Synthetic Data
Platinum [2nd Floor]