EuroSciPy 2026

Making LLM Evaluation Reproducible in Python
2026-07-21 , Room 1.19 (Ground Floor, Shannon)

Large Language Models are increasingly integrated into scientific and production workflows, yet evaluation practices often remain informal and notebook-driven. This talk explores how to build reproducible, measurable, and regression-safe LLM evaluation pipelines using Python. We will examine dataset design, metric selection, deterministic evaluation harnesses, and CI integration strategies that transform LLM experimentation into disciplined, testable engineering workflows.


LLM-powered systems are rapidly moving from experimentation to operational use across research and applied domains. However, evaluation practices frequently remain ad hoc: prompt tweaks in notebooks, manual spot-checking, and loosely defined metrics.

In scientific computing, reproducibility and rigor are foundational. This talk explores how to bring those same principles to LLM system evaluation using Python-based tooling.

We will examine practical approaches for designing reproducible evaluation pipelines, including:
- Constructing versioned evaluation datasets
- Defining measurable task-specific metrics (accuracy, faithfulness, consistency)
- Designing deterministic evaluation harnesses around probabilistic models
- Structuring experiments for comparability across model versions
- Integrating evaluation workflows into CI pipelines
- Tracking regression across prompts, embeddings, and retrieval strategies

Using concrete Python examples, we will demonstrate how lightweight testing patterns (inspired by pytest and CI best practices) can be adapted to LLM workflows. We will also discuss limitations: metric instability, evaluation drift, and trade-offs between strict reproducibility and model stochasticity.

The goal of this session is not to promote a specific framework, but to provide practical, stack-agnostic engineering patterns that make LLM evaluation measurable, repeatable, and auditable.

Attendees will leave with:
- A structured approach to LLM evaluation design
- A template for building Python-based evaluation harnesses
- Strategies for maintaining comparability across experiments
- Practical insight into making probabilistic systems testable

This session is aimed at practitioners who want to move beyond exploratory notebooks and adopt disciplined evaluation workflows aligned with scientific computing standards.


Expected audience expertise: Domain: some Expected audience expertise: Python: some Your relationship with the presented work/project: Original author or co-author

12-time award-winning AI lead and 'Sculpting Data For ML' author Jigyasa Grover drives rider personalization innovation at Uber after transforming Twitter/X, Facebook/Meta, Faire, and Bordo AI with large-scale ML systems. Handpicked by Google for their I/O 2024 keynote, she serves on Google's Developer Advisory Board while advising social search engine Diem and other Silicon Valley startups.

As a LinkedIn Learning instructor, Jigyasa educates thousands of professionals worldwide on cutting-edge AI-powered applications and agentic AI systems, solidifying her status as a thought leader in artificial intelligence education. As a Google Developer Expert, Women Techmaker Ambassador, and World Economic Forum Global Shaper, Jigyasa has also been featured in Forbes, Business Insider, VentureBeat, and International Business Times, and has elevated panels with Harvard University, Preston-Werner Ventures, Norwegian Business School, Humanitarian Frontier in AI, Women in Data, and more to her name.

The UC San Diego alumna has secured funding from the Canadian and Norwegian governments, the Linux Foundation, and multiple tech giants, enabling work that transcends geographical boundaries. With 200+ media features and contributions to open source recognized by Apache and Python Software Foundations, she mentors next-generation talent while shaping AI's future through advisory roles at Bezoku AI, Las Positas College, and various AI forums.