2025-12-10 –, Horace Mann
Building accurate AI workflows can get complicated fast. By explicitly defining and modularizing agent tasks, my AI flows have become more precise, consistent, and efficient—delivering improved outcomes consistently. But can we prove it? In this talk, I'll walk you through an agentic app built with Langflow, and show how giving agents narrower, well-defined tasks leads directly to more accurate, consistent results. We'll put that theory to the test using evals with Pytest and LangSmith, iterating across different agent setups, analyzing data, and tightening up the app. By the end, we'll have a clear, repeatable workflow that lets us have confidence in how future agent or LLM changes will affect outcomes, before we ever hit deploy.
Building reliable AI workflows requires more than clever prompts—it demands structure, testing, and iteration. In this talk, we’ll look at how modular agent design and evaluation can make large language model (LLM)–based systems more predictable and trustworthy.
We’ll start by demonstrating an agentic application in Langflow, defining each agent’s responsibilities explicitly and observing how tighter task boundaries improve consistency and conciseness. From there, we’ll add some LangSmith decorators to an existing Pytest setup, treating LLM behaviors as testable components. You’ll see how to benchmark different agent configurations and measure improvement over time with reproducible metrics and data.
By the end, you’ll have a concrete workflow for iterating on agent designs the same way you would tune and validate any other machine learning model. Attendees should be comfortable with Python and have some familiarity with LLM frameworks or API-based agents, but all examples will be self-contained and reproducible.
A Gen-AI / Agentic nerd with decades of coding experience who loves to learn and help others do the same!