Juliacon 2024

DiskArrays.jl for working with large disk-based nd-arrays
2024-07-10 , Function (4.1)

DiskArrays.jl provides the AbstractDiskArray interface for chunked and compressed n-dimensional arrays with slow random access. Implementing the interface gives access to a wide set of indexing, views, reductions and broadcasting. Downstream packages can optimize parallel operations on these chunked arrays using the AbstractDiskArray interface.


Scientific data formats like NetCDF, HDF5 or Zarr allow storing large array data on local or network drives or in the cloud.
Julia packages for these data types provide custom getindex/setindex methods on datastructures that map to an array on disk.
However, implementing the full AbstractArray interface has proven to be difficult, since the interface is based on the assumption of efficient access to single array elements.
In many cases data are internally stored in compressed chunks, so that a naive implementation of the AbstractArray interface would lead to very poor performance for many IO operations.
DiskArrays.jl provides the AbstractDiskArray interface, for which backends only have to implement methods to read and write dense hyperrectanglular subsets of the array for their data type.
In addition to read/write support an interface for querying the internal chunking structure of the disk-based array is provided by DiskArrays.jl to simplify the implementation of efficient mapreduce and broadcast operations on these arrays.
Downstream packages can work with diskarrays to define chunk-based computations in a backend-agnostic way and we present an overview of how downstream packages like Rasters.jl, YAXArrays.jl or DiskArrayEngine.jl provide efficient and user-friendly ways to perform operations on larger-than memory compressed arrays.

See also:

Felix Cremer received his diploma in mathematics from the University of Leipzig in 2014. In 2016 he started his PhD study on time series analysis of hypertemporal Sentinel-1 radar data.
He is interested in the use of irregular time series tools on Synthetic Aperture Radar data to derive more robust information from these data sets.
He worked on the development of deforestation mapping algorithms and on flood mapping in the amazon using Sentinel-1 data.
He currently works at the Max-Planck-Institute for Biogeochemistry on the development of the JuliaDataCubes ecosystem in the scope of the NFDI4Earth project. The JuliaDataCubes organisation provides easy to use interfaces for the use of multi dimensional raster data

This speaker also appears in:

I am a physicist by training and am currently studying Global Biogeochemical Cycles in the Earth System using Remote Sensing, Meteorological and other data sets based at the Max-Planck-Institute for Biogeochemistry, Jena, Germany.
My first commit to my first Julia package dates back to the year 2012 and since then I have authored and contributed to packages in the Julia Geodata and processing ecosystem, examples are NetCDF.jl, Zarr.jl, DiskArrays.jl, YAXArrays.jl EarthDataLab.jl and others. Some may know me under my github tag @meggart

This speaker also appears in: