Build Training and Evaluation Datasets That Actually Work: A Hands-On Synthetic Data Pipeline Workshop
Whether you're fine-tuning an LLM, building an evaluation benchmark, or generating domain-specific training data, the quality of your dataset directly determines the quality of your model. Yet most teams still create training and eval data through ad-hoc prompting — producing datasets that lack diversity, have no validation, and can't be reproduced or iterated on systematically.
In this hands-on tutorial, participants will build synthetic training and evaluation datasets from scratch using a declarative Python framework that unifies statistical sampling, LLM-based generation, and automated validation into a single pipeline designed for AI builders.
Starting with controlled samplers to define task distributions and difficulty gradients, we'll progressively layer in LLM-generated instruction/response pairs that reference sampled context, expression-based derived columns, code generation with linting validation, and LLM-as-judge quality scoring to rank training examples. By the end, participants will have built a complete, multi-strategy dataset — ready for model fine-tuning or evaluation — with dependency management, quality gates, and reproducible configuration, all in under 50 lines of Python.
Each exercise builds on the previous one, demonstrating how the framework's DAG-based execution engine automatically resolves column dependencies and parallelizes work. Participants will experience the rapid iteration loop of preview (fast, in-memory) and create (production-scale, disk-backed), learning to prototype data designs quickly and generate production training data confidently.