Rishabh Misra EuroSciPy 2026

Rishabh Misra
.ical

Session

Making LLM Evaluation Reproducible in Python

Large Language Models are increasingly integrated into scientific and production workflows, yet evaluation practices often remain informal and notebook-driven. This talk explores how to build reproducible, measurable, and regression-safe LLM evaluation pipelines using Python. We will examine dataset design, metric selection, deterministic evaluation harnesses, and CI integration strategies that transform LLM experimentation into disciplined, testable engineering workflows.

Large Language Models (LLMs), Neural Networks and AI Development

Room 1.19 (Ground Floor, Shannon)

Rishabh Misra .ical

Session

Rishabh Misra
.ical