2024-07-12 –, Method (1.5)
This package offers efficient storage of large array snapshots during simulations, using ZFP compression for in-memory or out-of-core (file-based) storage. Both compress 2D to 4D arrays for intermediate steps in iterative process, which can be used for gradient calculation, plotting, and other applications. The Innn-memory compresses arrays into byte vectors, while the out-of-core storage compresses and writes array slices to files per thread, showing great performance in heavy simulations.
It is common in numerical simulations to store the state variable, commonly an array, at each or select iterations of the simulation. This is done for many reasons, such as restarting the simulation from a certain point, analyzing the evolution of the state variable, or calculating the gradient using the adjoint state method. In this work, we present a package that aims to provide a simple interface to compress arrays that compose the state of a simulation using the ZFP compression algorithm.
Preserving a simulation's state is a well-established practice, often achieved through advanced techniques. In the context of implementing the adjoint state method, it is essential to store the state of the forward modelling function. This stored state is then utilized in each iteration of the adjoint modelling to compute the gradient. However, due to the extensive data involved – a product of the grid's spatial dimensions and the number of temporal iterations – storing every state in memory becomes impractical for large-scale models.
The more straightforward solution is to store every state in a data storage device, which raises several other issues. The most obvious is that the storage device might need more space to support the size of the collection of arrays. Given enough space, the storage device has to have enough performance not to increase the simulation time per iteration considerably. However, even if it can store all of them, it is a matter of how many write cycles this storage device can handle before it breaks.
Compression tries to trade computation time for a reduction in needed storage, increasing the storage device's lifespan and allowing it to fit more data. However, several algorithms have different parameters to consider when it comes to floating point compression.
The chosen one was the ZFP algorithm (Lindstrom, 2014; Diffenderfer et al., 2019), with the wrapper ZfpCompression.jl, because it can handle different levels of lossy compression with different parametrizations that allow more control over throughput, space-saving and decompressed image quality. It is a lossy and lossless compression algorithm that can compress arrays of floating point numbers targeting numerical simulations. The main argument for its creation was that simulations usually have barely any size decrease when using lossless compression, as can be seen in Table 4 of Ratanaworabhan et al. (2006), given that the numbers are all floating point with very little room for deduplication techniques commonly employed by this kind of algorithm. However, lossy compression can use thresholding/truncation techniques to decrease the size of the array. However, it has to be careful to keep the signal-to-noise ratio lower so that it does not reduce the accuracy of the simulation. The ZFP algorithm allows users to control how much accuracy, throughput or precision is necessary for their specific problem and is highly parallelizable. Thus, the package described in this paper uses it in the backend by leveraging the ZfpCompression.jl package.
The package SequentialCompression.jl described in this report can compress a collection of arrays, fed to a data structure sequentially, and decompress arrays in any order by simple linear indexing. It has two compressors; one compresses whole arrays and saves them in a long binary vector in memory, and the other splits each array on its last dimension and compresses and saves each piece to one or several data storage devices, with one file per CPU thread. The former will be called the ''in-memory'' compressor, and the latter the ''multifile'' compressor. It aims to provide a simple interface to compress and access these arrays.
The interface of the described package provides much functionality in the form of optional keyword arguments for the constructor. The user can control what compressor he wants to use and specify a prediction of the number of arrays to be compressed to optimize the in-memory compressor. For the multi-file compressor, the user can specify a folder containing all compressed array slice files or different folders in different devices to maximize bandwidth.
The purpose of this package is to facilitate the usage of lossy compression of iterative processes such as physical simulations, where it is frequent to have arrays of the same size at each iteration. Depending on the number of threads, the performance obtained with the multifile compressor is between saving the array into disk and another higher dimensional array. In summary, it has the functionality and performance to be easily integrated into simulation research workflows.
I'm currently a PhD student in Physics applied to Geophysics in the Signal Analysis and Imaging Group, at the University of Alberta. Both my bachelor and master's are in Geophysics, from the Universidade Federal da Bahia, in Brazil. My interests are HPC, Seismic imaging and inversion, and Julia!
In 1988, I earned a Geophysics degree from the National University of La Plata, Argentina, and moved to Canada in 1993 for graduate studies. I completed my Ph.D. in Geophysics at the University of British Columbia in 1996, followed by a postdoctoral fellowship. I then joined the University of Alberta as a faculty member in the Physics Department, becoming a tenured professor in 2001 and a full professor in 2007. I served as Chair of Physics from 2010 to 2015 and was re-appointed in 2016, concluding my second term in June 2021. At Alberta, I established an applied research Geophysics group and supervised around 60 graduate students.