PyData London 2026

Lipika Ramaswamy


Session

06-05
16:00
90min
From Synthetic Examples to Production Signals: Multimodal Training Data Pipelines with Privacy-Safe Feedback
Nabin Mulepati, Lipika Ramaswamy

Production AI systems improve through a data flywheel: teams create training examples from curated source material, those examples shape model behavior, production usage reveals what the model still needs, and those usage signals become the next round of improvement. This hands-on tutorial focuses on the data pipelines behind that flywheel: how to generate, validate, and anonymize training data without relying on one-off prompt scripts.

Participants will build a reproducible training-data pipeline using NVIDIA NeMo Data Designer and NeMo Anonymizer. We'll start by working through text-based examples that introduce the basics of Data Designer: defining the shape of a dataset, connecting generation to source records, creating structured outputs, and filtering generated rows with judge-based quality checks. Then we'll extend the same pattern to multimodal document understanding with image-based invoice data and VLM-generated visual QA examples.

Finally, we'll shift from workshop-generated data to production usage data from a fine-tuned model. Using Anonymizer, participants will detect and transform sensitive fields so usage logs can safely become source material for the next training iteration.

By the end, participants will understand a practical pattern for multimodal training data with privacy-safe feedback: source data -> generate -> validate -> anonymize feedback -> improve.

Doddington Forum