Constrained Data Synthesis
2019-09-04, 14:45–15:15, Track 2 (Baroja)

We introduce a method for creating synthetic data "to order" based on learned (or provided) constraints and data classifications. This includes "good" and "bad" data.


Synthetic data is useful in many contexts, including

  • providing "safe", non-private alternatives to data containing personally identifiable information
  • software and pipeline testing
  • software and service development
  • enhancing datasets for machine learning.

Synthetic data is often created on a bespoke basis, and since the advent of generative adverserial networks (GANs) there has been considerable interest and experimentation with using those as the basis for creating synthetic data.

We have taken a different approach. We have worked for some years on developing methods for automatically finding constraints that characterise data, and which can be used for testing data validity (so-called "test-driven data analysis", TDDA). Such constraints form (by design) a useful characterisation of the data from which they were generated. As a result, methods that generate datasets that match the constraints necessarily construct datasets that match many of the original characteristics of the data from which the constraints were extracted.

An important aspect of datasets is the relationship between "good" (~ valid) and "bad" (~ invalid) data, both of which are typically present. Systems for creating useful, realistic synthetic data generally need to be able to synthesize both kinds, in realistic mixtures.

This talk will discuss data synthesis from constraints, describing what has been achieved so far (which includes synthesizing good and bad data) and future research directions.


Domains – Big Data, Machine Learning, Simulation Domain Expertise – some Python Skill Level – professional Project Homepage / Git – https://github.com/tdda Abstract as a tweet – Creating good and bad synthetic data to order using constraints