2026-06-07 –, Doddington Forum
Most RAG demos stop at retrieval and summarisation. In practice, we also need to measure the understanding of users, models, and the source material. This talk introduces a reusable evaluation pattern that turns any document into a live-graded “exam engine” using Python tools including Docling, DeepEval, and Marimo.
We will build a stateful application that generates multiple-choice and free-text questions from complex documents, creates realistic distractors, and scores answers in real time using an LLM-as-judge pipeline. The demo is intentionally playful, but each component maps to a production concern: layout-aware ingestion (tables and figures), synthetic QA dataset creation, semantic grading, and interactive evaluation loops.
Attendees will learn how to move beyond passive RAG towards systems that benchmark knowledge, support training workflows, and enable human-in-the-loop evaluation.
RAG systems typically answer questions but rarely evaluate whether the answer, or the user, actually demonstrates understanding. That requires structured datasets, grading logic, and application state, not just retrieval.
In this talk, we build a live-graded “knowledge arena”: a Python application that converts a dense technical document into an interactive quiz with two modes:
- Easy mode - automatically generated multiple-choice questions with plausible distractors
- Expert mode - free-text answers scored in real time using semantic LLM metrics
The implementation illustrates several reusable production patterns:
- Document ingestion (Docling): Extracting layout, tables, and figures so evaluation covers the full source rather than plain text only.
- Synthetic dataset generation (DeepEval): Creating “golden” QA pairs and automated distractors for benchmarking and training.
- LLM-as-judge grading: Scoring free-text answers with semantic metrics instead of brittle string matching.
- Stateful Python UI (Marimo): Managing interaction and evaluation loops without custom JavaScript.
Although the interface is playful, the architecture generalises to production RAG and agentic knowledge systems for benchmarking, training, and human-in-the-loop evaluation.
This talk presents a reusable LLM-as-judge architecture for evaluating understanding in RAG systems using synthetic QA generation and real-time semantic grading in Python. All demo components are pre-built and run locally with cached models and datasets.
Audience / Prerequisites
- Intermediate Python users familiar with basic LLM and RAG concepts (embeddings, retrieval).
- No prior experience with Docling, DeepEval, or Marimo required.
Key Takeaways
- A reusable LLM-as-judge evaluation pattern for RAG
- How to generate QA benchmarks from documents automatically
- Techniques for handling tables and figures in ingestion
- Where live grading fits into production workflows
Adam is a Staff Data Scientist at ComplyAdvantage, where they are tackling financial crime with advanced analytics, large-scale systems, and the latest in generative and agentic AI.
Before that, he spent eight years in the smart cities space at HAL24K, helping governments and infrastructure providers make better decisions with their data. Along the way, he built and led a team of ten data scientists and helped launch four spin-out ventures.
A recovering astrophysicist, Adam spent a decade analysing data from space telescopes in search of new cosmic phenomena. He’s since redirected that curiosity toward Earth-based problems.
Adam is an active member of the PyData community, the founder of PyData Southampton, and a long-time volunteer with DataKind UK, supporting charities and NGOs with pro-bono data science.