DataSets.jl: A bridge between code and data
2021-07-29 , Purple

In technical computing, getting data into and out of your code can be a pain. Data comes in all shapes, sizes and formats, with many different locations and storage access mechanisms.

DataSets.jl is a new package for describing data declaratively and mapping it neatly into your programs. We aim to make your code portable between data environments and remove the cruft of local paths and data access wrappers which litter technical analysis code.


DataSets.jl is an open source package for describing data format and location declaratively so that one can better separate data deserialization and access from the domain-specific analysis code which consumes that data.

To quote from the package documentation available at https://juliacomputing.github.io/DataSets.jl/dev :

DataSets.jl exists to help manage data and reduce the amount of data wrangling
code you need to write. It's annoying to constantly rewrite
* Command line wrappers which deal with paths to data storage
* Code to load and save from various data storage systems (eg, local
filesystem data; local git data, downloaders for remote data over various
protocols, cloud storage access)
* Code to load the same data model from various serializations
* Code to deal with data lifecycle; versions, provenance, etc

DataSets.jl provides scaffolding to make this kind of code more reusable. We want
to make it easy to relocate an algorithm between different data environments
without code changes. For example from your laptop to the cloud, to another
user's machine, or to an HPC system.

I'm a long time enthusiastic user of Julia and enjoy contributing to various packages across the open source ecosystem, Julia standard libraries and compiler. I love hearing about people's fascinating technical computing adventures of all types! Find me at https://github.com/c42f