PyCon UK 2025

No data, no problem: Synthetic Data using LLMs and Faker
2025-09-19 , Space 2

What do you do when you need to build data products, but don’t have any data? Learn how to generate complex synthetic data using Faker. We will generate typical user journeys through a website, creating a synthetic data stream that could be used for development, testing or modelling.


Ever had a project grind to a halt because the data you need contains personal information, is confidential or protected by sector regulations? Synthetic data generation offers a practical solution to overcome these blockers, enabling development, testing, and work to continue without compromising data protection requirements.

This talks focuses on two distinct approaches to synthetic data creation. First, we assess the capabilities and limitations of Large Language Models as data generators, evaluating their effectiveness in producing structured event data. This is compared against Faker, an established Python library specifically designed for synthetic data generation.

We will apply these technologies to generate synthetic data that replicates website visitor behaviour patterns. Website analytics typically provide well-defined event structures and documented user journey patterns, making this an ideal candidate for synthetic data generation. The challenge lies in creating complex datasets that accurately represent the varied pathways users take through websites whilst maintaining realistic interaction patterns.

The project implements workflow that accepts specifications for user journey patterns, event schemas, and controlled randomness parameters. The output is a configurable event stream generator that mimics user journeys through a website. Come learn how you can generate similar fake data yourself...because sometimes the best real data is the data you make yourself.

Outline
1. Understanding the use case
2. Can't AI do this?
3. Faker to the rescue
4. Faker Providers
5. How we used Faker


What level of experience do you expect from your audience for this session?:

Intermediate

I lead the Engineering function at Tasman Analytics, a boutique data consultancy. We act as an interim/fractional data team and are passionate about helping clients leverage the power of their data.

Personally, I have a background of mechanical engineering and have worked across a range of sectors including sustainability, energy, property, construction and architecture. I am an engineer at heart and perennially look to hone the craft of engineering.