Constrained Data Synthesis
2019-09-04 , Track 2 (Baroja)

We introduce a method for creating synthetic data "to order" based on learned (or provided) constraints and data classifications. This includes "good" and "bad" data.


Synthetic data is useful in many contexts, including

  • providing "safe", non-private alternatives to data containing personally identifiable information
  • software and pipeline testing
  • software and service development
  • enhancing datasets for machine learning.

Synthetic data is often created on a bespoke basis, and since the advent of generative adverserial networks (GANs) there has been considerable interest and experimentation with using those as the basis for creating synthetic data.

We have taken a different approach. We have worked for some years on developing methods for automatically finding constraints that characterise data, and which can be used for testing data validity (so-called "test-driven data analysis", TDDA). Such constraints form (by design) a useful characterisation of the data from which they were generated. As a result, methods that generate datasets that match the constraints necessarily construct datasets that match many of the original characteristics of the data from which the constraints were extracted.

An important aspect of datasets is the relationship between "good" (~ valid) and "bad" (~ invalid) data, both of which are typically present. Systems for creating useful, realistic synthetic data generally need to be able to synthesize both kinds, in realistic mixtures.

This talk will discuss data synthesis from constraints, describing what has been achieved so far (which includes synthesizing good and bad data) and future research directions.


Project Homepage / Git:

https://github.com/tdda

Abstract as a tweet:

Creating good and bad synthetic data to order using constraints

Python Skill Level:

professional

Domain Expertise:

some

Domains:

Big Data, Machine Learning, Simulation

Nick is a practising data scientist with over 30 years experience, from neural networks and genetic algorithms on parallel systems in the late 1980s, through parallel machine learning and 3D visualisation software as a founder of Quadstone, from 1995, to novel modelling methods (e.g. uplift modelling) in the early 2000s. Since 2007 , he has run Edinburgh data science specialists Stochastic Solutions.

Nick enjoys using his deep knowledge of underlying algorithms to fashion tailored solutions to practical business problems for clients including Barclays, Sainsburys, T-Mobile and Skyscanner, and has a particular interest in testing and correctness in data science.