Making LLM Evaluation Reproducible in Python
Jigyasa Grover, Rishabh Misra
Large Language Models are increasingly integrated into scientific and production workflows, yet evaluation practices often remain informal and notebook-driven. This talk explores how to build reproducible, measurable, and regression-safe LLM evaluation pipelines using Python. We will examine dataset design, metric selection, deterministic evaluation harnesses, and CI integration strategies that transform LLM experimentation into disciplined, testable engineering workflows.
Large Language Models (LLMs), Neural Networks and AI Development
Room 1.19 (Ground Floor, Shannon)