2026-07-21 –, Room 2.41 (First Floor, Turing)
Reproducibility remains one of the most persistent challenges in scientific computing. Despite excellent tools like conda, pixi, and Jupyter, studies continue to show that a significant fraction of published computational results cannot be reproduced often due to undocumented dependencies, hidden notebook state, or fragile glue code between pipeline stages.
Meanwhile, AI agents autonomous systems that can reason, plan, and execute multi-step tasks have matured rapidly in industry settings. Frameworks like smolagents, PydanticAI, and DSPy now make it feasible to build agents that inspect environments, trace data lineage, and verify computational workflows with minimal human intervention.
This talk bridges these two worlds. Drawing on practical experience building production AI agent systems, I will present concrete design patterns for agents that serve as "reproducibility assistants" in scientific Python workflows. The talk covers three actionable areas:
(1) automated environment auditing agents that detect undeclared dependencies and version conflicts
(2) notebook-to-pipeline conversion agents that analyze Jupyter notebooks for hidden state and generate deterministic scripts
(3) result verification agents that re-execute computational steps and flag numerical divergence. A live demo will show an agent auditing a real scientific Python project end to end.
Motivation
The scientific Python ecosystem has made enormous progress on reproducibility infrastructure. Tools like conda-lock, pixi, uv, and containerization solutions address environment pinning. Initiatives like SPEC 7 (seeding pseudo-random number generators) tackle numerical determinism. Community efforts like The Turing Way and pyOpenSci provide guidelines and review processes.
Yet a gap persists: these tools exist, but adoption is inconsistent, and the "last mile" of reproducibility verifying that a workflow actually reproduces on a clean machine remains largely manual. The CODECHECK initiative, which pairs papers with independent reproducibility reviewers, has verified only a few hundred papers after years of operation, limited by the human effort required per check.
AI agents offer a compelling way to scale this verification and assist researchers in making their work reproducible from the start.
What This Talk Covers
1. The Reproducibility Gap in Practice (3 minutes)
A brief, data-driven overview of where scientific Python workflows break down:
- Environment specification failures (the "works on my machine" problem)
- Jupyter notebook hidden state (out-of-order execution, unreferenced variables)
- Glue code fragility between pipeline stages (pandas to scikit-learn to matplotlib)
- The CODECHECK scaling bottleneck
2. AI Agent Design Patterns for Reproducibility (5 minutes)
Three concrete patterns, each with working code:
Pattern A: Environment Forensics Agent
An agent that inspects a project directory, cross-references imports against declared dependencies (pyproject.toml, environment.yml, requirements.txt), detects version conflicts, and generates a minimal reproducible environment specification. Built using smolagents with tool-calling capabilities.
Pattern B: Notebook Linter Agent
An agent that statically and dynamically analyzes Jupyter notebooks to detect hidden state dependencies, out-of-order cell execution paths, and undeclared side effects. It produces a dependency graph of cells and can generate an equivalent standalone Python script. Leverages nbformat and AST analysis as agent tools.
Pattern C: Result Verification Agent
An agent that re-executes computational steps in an isolated environment and compares outputs against stored results, flagging numerical divergence beyond configurable tolerance (addressing floating-point and RNG reproducibility per SPEC 7 guidelines).
3. Live Demo (4 minutes)
A live demonstration where an AI agent audits a real open-source scientific Python project on stage. The agent will:
- Scan the project and cross-reference declared dependencies against actual imports, identifying missing pins and undeclared packages
- Analyze a Jupyter notebook for hidden state issues, producing a cell dependency graph
- Generate a corrected environment specification and a reproducible standalone script
The demo will show both what the agent gets right and where it fails giving the audience an honest, unvarnished look at the current state of the technology. This is not a polished product demo; it is a transparent exploration of what works and what does not.
4. Risks and Community Considerations (2 minutes)
Honest discussion of where agents fall short:
- Hallucination risks in dependency resolution
- The danger of "reproducibility theater" agents that report success without truly verifying
- Community governance questions: the ongoing SciPy AGENTS.md discussion and what it means for AI-assisted contributions
- Computational cost and environmental impact of running verification agents
5. Call to Action (1 minute)
- How the scientific Python community can adopt these patterns today
- Open questions that need community input
- Invitation to collaborate on open-source reproducibility agent tooling
Why This Talk Fits the Community, Education, and Outreach Track
This talk is not about AI agents as a product it is about how the scientific Python community can leverage agent patterns to solve a shared challenge. The focus is on:
- Community: Addressing reproducibility benefits every member of the ecosystem, from students to senior researchers
- Education: Teaching practical agent design patterns that attendees can immediately apply
- Outreach: Bridging the gap between industry AI engineering practices and open science needs
Target Audience
Researchers, research software engineers, and data scientists who work with the scientific Python stack and care about reproducibility. No prior experience with AI agents is required the talk introduces agent concepts through the lens of familiar reproducibility problems.
Nitish Agarwal is a Senior Engineering Manager at GoDaddy with 14+ years of experience in cloud architecture and AI. He has previously led engineering teams at Skyscanner, Expedia, and Balena, delivering high-availability systems for 80M+ users. Nitish specializes in scaling teams, optimizing distributed systems, and currently leads GoDaddy's AI transformation strategies for customer care.