PyData Boston 2025

Where Have All the Metrics Gone?
2025-12-09 , Horace Mann

How exactly does one validate the factuality of answers from a Retrieval-Augmented Generation (RAG) system? Or measure the impact of the new system prompt for your customer service agent? What do you do when stakeholders keep asking for "accuracy" metrics that you simply don't have? In this talk, we’ll learn how to define (and measure) what “good” looks like when traditional model metrics don’t apply.


In the good old supervised learning days, standard measures like accuracy, F1, and MSE were like blazes on the data science trail, showing us how to descend the gradient towards "better". But now we're in uncharted analytics territory, where our work increasingly involves unlabeled data and generative AI outputs, and metrics are either unavailable or undefined.

The key to every successful trek is preparation. We have to move from thinking about “metrics as defaults” to “metrics as design choices." We also need to be ready to design those metrics before we even start testing, because when we devise metrics post-training, we risk HARKing (Hypothesizing After Results are Known) and losing our scientific footing.

This talk will provide a field guide for translating different kinds of modern research questions into clearly-defined metrics, including:
* Metrics of the past and why they aren't as useful now (~5 min)
* Common failure modes when attempting to evaluate generative AI outputs and other unlabeled data (~8 min)
* Techniques for identifying proxies when labels are missing (~8 min)
* Defining criteria for open-ended outputs (~8 min)
* Open source Python libraries (including new tools like outlines and dspy as well as old favorites like hypothesis and pytest) to equip you for your next data science adventure (~8 min)

Come learn how to define and adapt new metrics so that you'll be prepared for wherever your modeling journey takes you.


Prior Knowledge Expected: No previous knowledge expected

Dr. Rebecca Bilbro, co-founder and CTO of Rotational Labs, is a trailblazer in applied AI and machine learning engineering. She co-created Yellowbrick, a Python library that enhances model diagnostics by integrating scikit-learn and matplotlib APIs, facilitating more intuitive model steering.

At Rotational Labs, Dr. Bilbro leads initiatives that empower companies to harness their domain expertise and data, resulting in the successful deployment of large language models and data-driven products. Her efforts bridge the gap between data science and engineering, driving AI solutions that are grounded in real-world business needs, informed by research, rigorously prototyped, and built with deployment and data governance in mind.

She is the co-author of Applied Text Analysis with Python (2018, O’Reilly) and Apache Hudi: The Definitive Guide (2025, O’Reilly). Dr. Bilbro earned her Ph.D. from the University of Illinois, Urbana-Champaign, focusing her research on domain-specific languages within engineering.