JuliaCon 2023

HDF5.jl: Hierarchical data storage for the Julia ecosystem
07-26, 12:00–12:30 (US/Eastern), 26-100

HDF5.jl is a Julia package for reading and writing data using the Hierarchical Data Format version 5 (HDF5) C library. HDF5 is a flexible, self-describing format suitable for storing complex scientific data, and is used as a container for many other formats.
This talk will give an overview of the HDF5 format and give an introduction and examples of basic usage of the HDF5.jl package. We will highlight some recent features and discuss future plans for the package.

As the name suggests, the HDF5 format allows storing data hierarchical layout of groups and datasets. It is self-describing, meaning that type and dimension metadata is stored in the file, alongside any custom metadata, known as attributes. Its flexibility means that it is widely used in many scientific domains, and as a container format by other libraries, including NetCDF, JLD.jl/JLD2.jl, PyTables and MATLAB’s MAT-files.

HDF5.jl is a Julia package for accessing HDF5 files, using the HDF5 C library maintained by The HDF Group. It provides a simple, high-level interface making it easy to save and load data, as well as a more flexible interface allowing users to take advantage of many of HDF5’s features. Although the HDF5.jl package has been around since 2012, we have recently undertaken some significant changes to improve the modularity and make available newer features.

Some recent feature additions to HDF5.jl package include:

  • Filter pipeline API that supports custom plugins and several advanced compression filter subpackages.
  • Distributed reading and writing with MPI.jl.
  • Reentrant API locks to prevent errors when accessing from multiple threads.
  • Virtual datasets, which support storing data across multiple HDF5 files.
  • Direct access to remote files stored on AWS S3.

Finally, we will discuss future plans:

  • A path for thread-parallel I/O operations with HDF5 files via the raw chunk API to access the byte-level layout of chunked datasets.
  • BinaryBuilder.jl provided binaries across all supported platforms.

Simon Byrne is the lead software engineer on the CliMA project, which aims to build a next-generation climate model in Julia.

This speaker also appears in:

Dr. Mustafa Mohamad is an Assistant Professor at the University of Calgary at the Schulich School of Engineering. His main research expertise is in stochastic dynamical systems, uncertainty quantification, extreme event analysis, and data-driven methods in science and engineering.

Mark Kittisopikul, Ph.D. is a Software Engineer II at the HHMI Janelia Research Campus in Ashburn, Virginia, USA. He currently focuses on computing applications surrounding light-sheet microscopy. Previously, Dr. Kittisopikul completed postdoctoral and doctoral work in Cell Biology, Biophysics, and Systems Biology at Northwestern University and the University of Texas Southwestern Medical Center. He previously studied Biological Chemistry and Mathematics at the University of Chicago.

In his free time, Mark enjoys cycling, puzzle games, and playing with his daughter.