2026-06-05 –, Doddington Forum
AI agents are moving into production in 2026, but when something goes wrong (a tool call fails silently, an LLM takes 13 seconds to respond, token costs spike overnight) teams struggle to diagnose issues across multi-step agentic workflows. In this hands-on tutorial you will solve a real problem on the island in Animal Crossing by building a Model Context Protocol (MCP) server in Python using FastMCP, instrumenting it with OpenTelemetry following the emerging GenAI and MCP semantic conventions and visualising end-to-end traces in a local Jaeger instance. Did I mention that events on the island occur in real time and are collected and processed using Apache Kafka?
You will learn how distributed tracing captures the hierarchical relationship between agent conversations, tool executions and MCP protocol messages, and how to use that visibility for debugging, cost analysis and performance optimisation (including picking the right model and checking if you’re drowning in serialisation overhead). You will leave with a fully instrumented MCP server, a Docker Compose real-time observability stack and the knowledge to bring production-grade observability to your own agentic AI systems.
Why this matters
OpenTelemetry is rapidly becoming the standard telemetry backbone for AI agents, just as it is already for microservices. It is one of the most active CNCF projects after Kubernetes, with native support from 30+ observability vendors. Its GenAI Special Interest Group declared 2025 the "year of AI agents" and has since published purpose-built semantic conventions for LLM calls, agent orchestration, and MCP tool calls. The industry has followed: Amazon launched Bedrock AgentCore Observability built entirely on OTel and GenAI semantic conventions; Grafana Labs demonstrated production tracing of the OpenAI Agents SDK and AWS Bedrock AgentCore.
However, most teams building agents today have none of this. The reason is a “developer experience gap”: many agent builders come from data science and ML research backgrounds, not distributed systems, and have never configured a tracing pipeline. Traditional monitoring tools don't capture the signals that matter for agents: token usage, cost per invocation, tool selection, multi-agent handoffs. Since agentic architecture is interaction-centric (98% of wall-clock time is spent in LLM API calls and tool executions, not your code), this means distributed tracing, not traditional metrics, is the primary observability signal. Without it, failures are invisible: one fintech company's agent ran in a loop for 11 hours accumulating $47,000 in costs before anyone noticed. This tutorial bridges that gap for the PyData audience.
What we will build
We will iteratively develop a FastMCP server that exposes tools for a fun real-time data engineering scenario, instrument it with OpenTelemetry and visualise the resulting traces.
- Build an MCP server (understand the MCP request/response lifecycle).
- OpenTelemetry for agentic AI (traces, metrics, logs and why they're the primary signal for agents).
- Instrument your MCP server (OpenTelemetry instrumentation, see how errors are automatically recorded with stack traces).
- From traces to dashboards (build a dashboard that answers which tools are slowest, showing error rates and token costs).
- Production patterns and case studies (patterns for sensitive data handling, sampling strategies for high-throughput agent workflows).
- Connecting auth and observability (auth attributes appearing in traces when OAuth is enabled, giving per-user visibility).
Target audience
Data engineers, data scientists, ML/AI engineers and SRE/platform engineers who are building or operating AI agents and need production visibility into agentic workflows. This is relevant to anyone deploying LLM-powered tools, multi-agent orchestration or MCP servers. Or you’re just a fan of Animal Crossing and social simulation gaming.
Prerequisites
- Basic to Intermediate Python (comfortable with decorators, async/await basics and uv).
- No prior knowledge of MCP, OpenTelemetry or FastMCP is required.
Tutorial requirements
- MacOS/Linux laptop or Windows with PowerShell.
- Docker, Colima or OrbStack (to run Docker Compose for the local observability stack).
- uv for package management.
- A code editor (VS Code, Cursor, Kiro or similar).
- LLM access, either via a vendor (Anthropic, OpenAI, etc) or local Ollama. With Ollama, we will be serving a 1B model, so you’ll need enough RAM and disk space.
- All materials will be distributed via a public GitHub repository with a README containing setup instructions and a Docker Compose file for the observability stack.
Key takeaways
- Understand why distributed tracing (rather than traditional metrics) is the primary observability signal for agentic AI systems.
- Be able to build an MCP server with custom tools using FastMCP and instrument it with OpenTelemetry.
- Know the OpenTelemetry GenAI and MCP semantic conventions and how they standardise telemetry across agent frameworks.
- Be able to visualise, query and dashboard agent traces using Jaeger.
- Understand the production observability landscape: auto-instrumentation libraries, sensitive data handling and compliance considerations.
Tun leads AI Engineering at Lenses, where he is focused on helping companies imagine and implement their strategic vision with agentic AI systems fuelled by real-time context. He was previously a Head of Data and Data/ML Engineer at high growth startups and has spent 20 years building data-intensive applications and leading T-shaped teams.
Tun is a co-organiser for the annual PyData London conference and co-founder of PyData Cornwall. He is a strong advocate in the Python AI engineering community and contributor to open source AI engineering and Apache Kafka tools.
In his spare time, Tun goes surfing, plays guitar and shoots 35mm film.
Data Engineer in AI Platform at The Economist, PyData Cornwall co-founder, and committed diversity and inclusion ally.