PyData Boston 2025

Is Your LLM Evaluation Missing the Point?
2025-12-10 , Horace Mann

Your LLM evaluation suite shows 93% accuracy. Then domain experts point out it's producing catastrophically wrong answers for real-world use cases. This talk explores the collaboration gap between AI engineers and domain experts that technical evaluation alone cannot bridge. Drawing from government, healthcare, and civic tech case studies, we'll examine why tools like PromptFoo, DeepEval, and RAGAS are necessary but insufficient and how structured collaboration with domain stakeholders reveals critical failures invisible to standard metrics. You'll leave with practical starting points for building cross-functional evaluation that catches problems before deployment.


The Problem

The Python ecosystem offers excellent LLM evaluation tools, yet evaluation projects still fail in production. One reason can be the gap between what technical metrics measure and what actually matters in real-world domains.

What This Talk Covers

This talk examines critical gaps between technical teams and domain experts that cause evaluation failures:

The Knowledge Gap
Domain experts identify problems invisible to technical metrics (e.g., proxy variables encoding historical bias, feedback loops creating self-fulfilling prophecies, and distributional justice issues). Through examples from healthcare, criminal justice, and child welfare, we'll explore what domain expertise reveals that accuracy scores miss.

The Communication Gap
Technical teams evaluate model outputs; domain experts evaluate real-world impact. This mismatch leads to optimizing the wrong objectives. We'll examine why metric translation fails and how to bridge model-focused versus user-focused evaluation approaches.

The Power Gap
When domain experts validate rather than co-design evaluation criteria, their most valuable contribution (defining what matters) gets lost. We'll look at when stakeholder involvement happens (and where it should happen in your pipeline) and why this timing matters for catching failures before they reach production.

Practical Starting Points

The talk concludes with concrete next steps like specific questions to ask domain experts, understanding where evaluation fits in your LLM pipeline, knowledge elicitation techniques, and how to use Model Cards as boundary objects for cross-functional collaboration.

Target Audience & Prerequisites

This talk is designed for intermediate-level data scientists and AI engineers implementing LLM applications in production, along with anyone else interested in AI evaluation. No specific tool expertise is required. Basic understanding of common evaluation concepts (e.g., accuracy, precision/recall) is helpful but not essential.

Why This Talk

Most LLM evaluation content focuses on tools and metrics. This talk addresses another challenge: how to bridge gaps between stakeholders who see risks that your metrics don't capture and AI engineers. The examples shared come from real deployments including direct experience with government implementations and published research on documented cases of evaluation failures that had serious consequences.


Prior Knowledge Expected: No previous knowledge expected

Daina brings technical depth and community-building expertise to her role as Sr. Developer Relations Engineer at Anaconda. With over 12 years bridging data science, library science, and open source advocacy, she's spent her career making complex technology more accessible to researchers and practitioners. Her work has included pioneering software citation and preservation initiatives at the Harvard-Smithsonian Center for Astrophysics and developing AI evaluation frameworks for federal agencies. This experience has given her insight into both the technical challenges developers face and the human side of adopting new tools. At Anaconda, she works to strengthen connections between Anaconda's engineering teams and the broader developer community, creating resources and fostering relationships that help people solve important problems with open source tools.