PyData Boston 2025

Evaluating AI Agents in production with Python
2025-12-10 , Thomas Paul

This talk covers methods of evaluating AI Agents, with an example of how the speakers built a Python-based evaluation framework for a user-facing AI Agent system which has been in production for over a year. We share tools and Python frameworks used (as well as tradeoffs and alternatives), and discuss methods such as LLM-as-Judge, rules-based evaluations, ML metrics used, as well as selection tradeoffs.


Many developers and companies are releasing applications powered by Agentic AI. During development, and especially after deployment, it is important to answer these questions:

How do we estimate the quality of responses of these AI applications?
If we make a change, how do we guarantee that the change is truly an improvement and won’t cause degradation in the user experience?
How can we easily test these results in a repeatable manner?

In addition to traditional software testing, evaluating generative AI applications involves statistical methods, nuanced qualitative review, and a deep understanding of user goals.

This talk outlines a Python-based evaluation framework that we’ve built to evaluate AI Agent-powered features that have been deployed in production for over a year, used by Fortune 500 companies.

The talk covers the following:
Test dataset creation and curation (5 minutes)
Tracing and evaluation for your live application (5 minutes)
Offline evaluations for development (10 minutes)
Evaluation methods: LLM-as-Judge, rules-based evaluation with Python (10 minutes)
Scoring mechanisms: Figure out how to best convey and roll up scores for product improvement (5 minutes)
You have evaluations, now what? We share the ways we use the evaluations to improve our software. (5 minutes)

Target audience: Data scientist, MLEs, AI Engineers, Software engineers who build GenAI applications and are interested in knowing how to evaluate them quantitatively in addition to qualitatively. The talk shares the Python based tools we used, but provides alternatives and tradeoffs that we considered/experienced.


Prior Knowledge Expected: Previous knowledge expected

Susan Shu Chang is a Principal Data Scientist at Elastic (Elasticsearch). She has spoke at 6 PyCons around the world, and is the author of Machine Learning Interviews (O'Reilly).