JuliaCon 2020 (times are in UTC)

Generating automatically labelled ML datasets with Lattice.jl
07-29, 13:50–14:00 (UTC), Red Track

Machine learning research relies heavily on curated data sets for experimentation, measuring performance, and communicating results. Labeling data is a labor most researchers would rather avoid, so the same standard data sets are constantly reused. Lattice.jl is our attempt to provide a much wider variety of machine learning datasets with little or no human effort.


Machine learning datasets are samples taken from a domain and an underlying data generating distribution. For example, the MNIST dataset is a sample from the set of all 28x28 grayscale images (the domain) that are created by humans writing the digits 0-9 (the data generating distribution). The examples in datasets are used to train models to perform machine learning tasks; for MNIST, the task is usually “what digit is this?” but the task could also be “is this digit odd?” or “was this digit written by a left-handed person?” Creating the label for each example--evaluating the task applied to that example--usually requires some sort of human effort.

Lattice.jl provides a framework for describing and abstracting domains, data generating distributions, and machine learning tasks for those problems where it is possible for a computer program to automatically generate examples. To explain the usefulness of such a framework, consider generating examples from an image domain. Since Lattice first generates a simple internal representation of the example and then uses the representation to draw the image, it is also possible to automatically create perfect labels for a wide variety of different tasks applied to each image. The labels and generating distributions can be manipulated to simulate many different kinds of labeling error and bias, and these problems can be introduced to completely different datasets in a systematic way.

This talk will provide a brief overview of the motivation and use cases for Lattice, and then show how the library relies on Julia’s type system to automatically pair abstract machine learning tasks with all of the domains where the tasks make sense. (For example, the abstract task “is X contained by Y?” can be instantiated and applied to images with at least two polygons, or images with bitmaps of a bird and a cage, but not to images of a single digit. In Lattice, pairs of domains and tasks don't need to be manually specified.) We’ll also show how multiple dispatch in Julia makes the implementation of Lattice--and the addition of new domains and tasks--remarkably simple. Finally, we'll share a few examples of using Lattice datasets with Flux.jl to study representation learning and hyperparameter optimization.

Don works at Oak Ridge National Lab on projects in artificial intelligence, high performance computing, remote sensing, and mathematics. He started using Julia the day it was first publicly announced.