Martin Seeler
Martin Seeler supercharges global supply chains with GenAI as Sr Staff AI Engineer at Blue Yonder. He ships AI that survives angry customers, skeptical executives, and Black Friday traffic. Speaks globally about the messy reality of production AI. Measures success in customer value delivered.
Session
Testing traditional software is "simple"... same input, same output. LLMs? Not so much. Same prompt, different result every time. So how do you actually know if your AI product is good?
Most teams struggle with this. Generic metrics like "Helpfulness: 4.2" sound scientific but don't drive real decisions. And when a new model releases, it's weeks of debates instead of data.
This talk introduces Error Analysis: a methodology to discover the concrete failure modes of your AI product and turn them into measurable evals. You'll learn how to build a failure taxonomy that enables real prioritization. Which issues are critical? Which are frequent? What should developers fix next, and how do you measure success?
The payoff: A real quality number for stakeholders. Concrete improvement tasks for developers. And when a new model drops, a ship-or-skip decision within 24 hours based on actual data.
Expect a meme-powered walkthrough, real-world examples from production, and a clear path to implement this yourself starting with just 20 traces.