PyConDE & PyData Berlin 2024

Sanket Verma

Sanket is a data scientist based out of New Delhi, India. He likes to build data science tools and products and has worked with startups, governments, and organisations. He loves building community and bringing everyone together and is Chair of PyData Delhi and PyData Global.

Currently, he's taking care of the community and OSS at Zarr as their Community Manager.

When he’s not working, he likes to play the violin and computer games and sometimes thinks of saving the world!


X / Twitter handle

@msankeys963

Github

https://github.com/MSanKeys963/

LinkedIn

http://linkedin.com/in/msankeys963


Session

04-23
14:10
30min
Exploring Zarr: From Fundamentals to Version 3.0 and Beyond
Sanket Verma

A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.

This talk presents a systematic approach to understanding the newer Zarr Specification Version 3 by explaining the critical design updates, performance improvements, and the lessons learned via the broader specification adoption across the scientific ecosystem.

I will also briefly discuss the evolution of the Zarr - the development of the Zarr Enhancement Process (ZEP) and its use to define the next major version of the specification (V3); as well as uptake of the format across the research landscape.

PyData: Data Handling & Engineering
A1