JuliaCon 2023

Robust data management made simple: Introducing DataToolkit
2023-07-28 , 32-123

An ad-hoc approach to acquiring and using data can seem simple at first, but leaves one ill-prepared to deal with questions like: "where did the data come from?", "how was the data processed?", or "am I looking at the same data?". Generic tools for managing data (including some in Julia) exist, but suffer from limitations that reduce their broad utility. DataToolkit.jl provides a highly extensible and integrated approach, making robust and reproducible treatment of data and results convenient.


In this talk, I will discuss the DataToolkit*.jl family of packages, which aim
to enable end-to-end robust treatment of data. The small number of other
projects that attempt to tackle subsets of data management —DataLad, the Kedro
data catalogue, Snakemake, Nextflow, Intake, Pkg.jl's Artifacts, DataSets.jl—
have a collection of good ideas, but all fall short of the convenience and
robustness that is possible.

Poor data management practices are rampant. This has been particularly visible
with computational pipelining tools, and so most have rolled their own data
management systems (Snakemake, Nextflow, Kedro). These may work well when
building/running computational pipelines, but are harder to use interactively;
besides which, not everything is best expressed as a computational pipeline.

Scientific research is another area where robust data management is vitally
important, yet often doesn't follow one or more of the FAIR principles
(findable, accessible, interoperable, reusable). For this domain, a more general
approach is needed than is offered by computational pipeline tools. DataLad and
Intake both represent more general solutions, however both also fall short in
major ways. DataLad lacks any sort of declarative data file (embedding info as
JSON in git commit messages), and while Intake has a (YAML — see
https://noyaml.com for why this isn't a great choice) data file format it only
provides read-only data access, making it less helpful for writing/publishing
results.

There is space for a new tool, providing a better approach to data management.
Furthermore, there are strong arguments for building such a tool with Julia, not
least due to the strong project management capabilities of Pkg.jl and the
independence from system state provided by JLL packages. An analysis performed
in Julia, with the environment specified by a Manifest.toml, the data processing
captured within the same project, and the input data itself verified by
checksumming, should provide a strong reproducibility guarantees — beyond that
easily offered by existing tools. A data analysis project can be hermetic.

One of the aims of DataToolkit.jl is to make this not just possible, but easy to
achieve. This is done by providing a capable "Data CLI" for working with the
Data.toml file. Obtaining and using a dataset is as easy as:

data> create iris https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv
julia> sum(d"iris".sepal_length) # do something with the data

This exercise generates a Data.toml file that declaratively specifies the data
sets of a project, how they are obtained and verified, and even capture the
preprocessing required before the main analysis. DataToolkit.jl provides a
declarative data catalogue file, easy data access, and automatic validation;
enabling data management in accordance with the FAIR principles.

Pkg.jl's Artifact system also provides these reproducibility guarantees, but is
designed specifically for small blobs of data required by packages. This shows
in the design, with a number of crippling caveats for general usage, such as:
the maximum file size of 2GB, only working with libcurl downloads (i.e. no GCP
or S3 stored data, etc.), and being unextensible.

Even if DataToolkit.jl works well in some situations, to provide a truly general
solution it needs to be able to adapt to unforeseen use cases — such as custom
data formats/access methods, and integration with other tools. To this end,
DataToolkit.jl also contains a highly flexible plugin framework that allows for
the fundamental behaviours of the system to be drastically altered. Several
major components of DataToolkit.jl are in fact provided as plugins, such as the
default value, logging, and memorisation systems.

While there is some overlap between DataToolkit.jl and DataSets.jl,
DataToolkit.jl currently provides a superset of the capabilities (for instance
the ability to express composite data sets, built from multiple input data sets)
with more features in development; and the design of DataToolkitBase.jl allows
for much more to be built on it.

The current plan for the talk itself is:
- An overview of the problem domain and background
- A description of the DataToolkit.jl family of packages
- A demonstration or example of using DataToolkit.jl
- Showing what it takes to implement a new data loader/storage driver
- An overview of the plugin system
- A demonstration of the plugin system

See also: Presentation slides (1.5 MB)