2026-06-07 –, Doddington Forum
Learn practical techniques for using LLMs to solve the data scarcity problem that plagues real-world ML projects. This talk demonstrates three production-ready approaches: synthetic generation, LoRA fine-tuning, and LLM-powered annotation to augment training datasets when you have abundant data for common cases but almost nothing for edge cases or emerging categories. Using a food review classification scenario, you'll see how to generate high-quality training data, when each technique works best, and critically, how to validate synthetic data to avoid amplifying errors. Perfect for practitioners facing the "we have 10k examples of X but zero for Y" problem.
Target Audience: Data scientists and ML engineers working on classification, NLP, or content moderation tasks who struggle with imbalanced or incomplete training datasets.
Takeaway: A decision framework for choosing between synthetic generation, fine-tuning, and LLM annotation, plus validation strategies to ensure data quality before retraining models.
Objective
Many machine learning teams struggle not because of model limitations, but because their datasets fail to cover rare classes, niche domains, or emerging user behavior. Traditional data augmentation techniques offer limited help for text, often producing surface-level variations without meaningful semantic diversity. This talk presents a practical framework for using large language models to augment NLP datasets.
Outline
- The Data Bottleneck: Why models trained on "standard" food language fail to generalize to "Molecular Gastronomy" or niche culinary terms.
- Three Complementary Techniques:
- Synthetic Generation: Creating fully labeled examples for missing classes.
- LoRA Adapters: Fine-tuning LLMs to control style and label consistency (e.g., matching a "Professional Critic" tone).
- LLM Annotation: Labeling large volumes of messy, real-world text from social media or external scrapes.
- Validation Strategies: Addressing error amplification and bias through human agreement checks, self-consistency, and "LLM-as-a-judge" approaches.
- Measuring Impact: Evaluating downstream model performance via rare-class recall, calibration, and error distribution.
Central Thesis and Takeaways
The session provides a decision framework for choosing between generation, fine-tuning, and annotation based on data availability and the need for style or tone. Attendees will walk away with strategies to ensure synthetic data quality before retraining their models.
Background Knowledge Expected
Basic knowledge of Python and familiarity with machine learning workflows (training, labelling, and evaluation) is recommended.
I am a Senior Machine Learning Scientist at Monzo, where my main focus is around Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and sophisticated data augmentation strategies. With 6 years of experience specializing in Natural Language Processing (NLP), I have a proven track record of building scalable AI systems for high-stakes environments.
Prior to joining Monzo, I was a Machine Learning Engineer at Bumble, leading Trust and Safety initiatives by developing LLM-powered moderation pipelines to ensure platform safety at scale. I also worked as a Senior Data Scientist at ComplyAdvantage, where I applied NLP to financial crime detection, and as a consultant at Sia, focusing on complex question-answering tasks.
I am passionate about the intersection of LLM infrastructure and practical data engineering, specifically solving the "cold-start" problem for niche domains through synthetic data and rigorous validation frameworks