Evaluating multi-turn conversations: A practical guide to AI Agent evals PyData London 2026

Evaluating multi-turn conversations: A practical guide to AI Agent evals
.ical
2026-06-06 11:50–12:35, Grand Hall 2

As AI agents become more popular, one question becomes increasingly important: how do you actually know if your agent is performing well? Multi-turn conversations are hard to evaluate because because there is rarely one right answer and at any given turn multiple responses can be correct. In this talk, we'll walk through a structured approach to evaluating complex conversations. We'll cover what makes a good conversation, techniques for evaluating multi-turn conversations where multiple outcomes are simultaneously valid, and how to scale evaluation pipelines. Finally, we'll discuss practical frameworks for continuous improvement and building confidence in your agent's real-world behaviour.

As AI agents move from demos to production, evaluating their performance becomes one of the most important challenges for teams shipping them. Unlike single-turn LLM calls, conversations are messy. You can't evaluate a response in isolation, each turn depends on prior context and a perfectly correct answer in one conversation might be wrong in another.

In this talk we'll discuss a systematic approach to evaluating complex multi-turn conversations.

We'll talk about:

Defining what makes a "good" conversation
The unique challenges of multi-turn evaluation
Metrics for assessing conversation quality
Constructing evaluation datasets for conversational AI agents
Automated pipelines for continuous agent evaluation in production

We'll show practical implementations using Python, with real-world examples from production agent systems across different domains.

Attendees will leave with:

A structured framework for defining and measuring conversation quality in their domain
Practical techniques for evaluating multi-turn interactions at scale

The session will provide actionable insights for AI engineers, data scientists, and product managers looking to evaluate AI agents rigorously and build stakeholder trust.

Lena Shakurova

Lena Shakurova is the founder of ParsLabs (https://parslabs.org), a Conversational AI agency, and Chatbotly (https://chatbotly.co), a no-code platform for building AI assistants trained on custom data.

At ParsLabs, she leads a team blending AI, user research and conversation science to design and develop high quality AI Conversations that sound human. She has background in NLP and Artificial intelligence and 8+ years of experience and 110+ successful projects building production-ready chatbots and voice assistants.

Lena focuses on ethical, user-first AI, leveraging her expertise in Linguistics & AI to create responsible, high-quality AI solutions. She shares insights on AI innovation and human-centered design through her blog (https://shakurova.io/blog) and LinkedIn (https://www.linkedin.com/in/lena-shakurova/).

Evaluating multi-turn conversations: A practical guide to AI Agent evals .ical 2026-06-06 11:50–12:35, Grand Hall 2

Evaluating multi-turn conversations: A practical guide to AI Agent evals
.ical
2026-06-06 11:50–12:35, Grand Hall 2