PyData London 2026

From Synthetic Examples to Production Signals: Multimodal Training Data Pipelines with Privacy-Safe Feedback
2026-06-05 , Doddington Forum

Production AI systems improve through a data flywheel: teams create training examples from curated source material, those examples shape model behavior, production usage reveals what the model still needs, and those usage signals become the next round of improvement. This hands-on tutorial focuses on the data pipelines behind that flywheel: how to generate, validate, and anonymize training data without relying on one-off prompt scripts.

Participants will build a reproducible training-data pipeline using NVIDIA NeMo Data Designer and NeMo Anonymizer. We'll start by working through text-based examples that introduce the basics of Data Designer: defining the shape of a dataset, connecting generation to source records, creating structured outputs, and filtering generated rows with judge-based quality checks. Then we'll extend the same pattern to multimodal document understanding with image-based invoice data and VLM-generated visual QA examples.

Finally, we'll shift from workshop-generated data to production usage data from a fine-tuned model. Using Anonymizer, participants will detect and transform sensitive fields so usage logs can safely become source material for the next training iteration.

By the end, participants will understand a practical pattern for multimodal training data with privacy-safe feedback: source data -> generate -> validate -> anonymize feedback -> improve.


This tutorial is for AI builders who want more discipline around training-data creation. The central premise is simple: the data models consume deserves the same engineering rigor as the models themselves.

Across three progressive Jupyter notebooks, participants will:
- Learn the Data Designer workflow through a text QA example, using explicit controls for the mix of examples, seed datasets, templated LLM generation, structured outputs, and LLM-as-a-judge quality checks.

  • Apply the same pipeline shape to multimodal document data, using invoice images as source records, VLM-generated summaries and question-answer pairs, and a judge step to filter for correctness and visual grounding.

  • Anonymize production-style usage data from a fine-tuned model, comparing privacy strategies that reduce sensitive-data risk while preserving useful training signal.

Participants leave with a working repo, runnable notebooks, and a reusable mental model for building training-data pipelines across text and images.

Takeaways
- A reproducible pattern for multimodal training-data generation.
- Practical use of source datasets, example-mix controls, dependency-aware columns, structured LLM outputs, and judge-based validation.
- A privacy workflow for turning production usage logs into safer source data for future training iterations.
- Hands-on experience with NeMo Data Designer and NeMo Anonymizer.
- A clear view of how synthetic generation, quality validation, and anonymized production feedback support a training-data lifecycle.

Why Attend This Session?
Most synthetic-data tutorials stop after generation. This session follows the full lifecycle: define the source material and the kinds of examples you want, generate text and multimodal training data, validate quality, anonymize production feedback, and prepare the anonymized data to be transformed into the next set of training examples.

Prerequisites
This is a hands-on notebook workshop. To follow along, please bring:
- A laptop where you can run Python and Jupyter notebooks.
- Basic comfort with Python, pandas-style dataframes, and editing notebook cells.
- Ability to clone a GitHub repository and run simple terminal commands. Setup instructions will include installing uv if you do not already have it.
- One hosted model API key configured in your environment or .env file. You can create a free NVIDIA_API_KEY at build.nvidia.com, or use OPENROUTER_API_KEY / OPENAI_API_KEY; if you use OpenRouter or OpenAI, any cost incurred during the session should be very minimal.
- Internet access for calling hosted LLM APIs during the exercises.

The workshop repository URL and setup instructions will be shared before the session. You do not need prior experience with NeMo Data Designer, or NeMo Anonymizer.

Research Scientist/Engineer at NVIDIA focused on Multimodal Synthetic Data Generation