PyConDE & PyData Berlin 2024

Exploring Zarr: From Fundamentals to Version 3.0 and Beyond
2024-04-23 , A1

A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.

This talk presents a systematic approach to understanding the newer Zarr Specification Version 3 by explaining the critical design updates, performance improvements, and the lessons learned via the broader specification adoption across the scientific ecosystem.

I will also briefly discuss the evolution of the Zarr - the development of the Zarr Enhancement Process (ZEP) and its use to define the next major version of the specification (V3); as well as uptake of the format across the research landscape.


Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by NumFOCUS under their umbrella.

It is based on open-source technical specification and has implementations in several languages, with Zarr-Python being the most used.

Outline

First, I’d be talking about:

Understanding Zarr basics (5 mins.)

  • What is Zarr, and how it works?
    • The inner workings of Zarr using illustrated graphics
  • What is the Zarr Specification?
    • How is Zarr different when compared to other storage formats?

Then, I'll be talking about the new Zarr Specification V3 and its significant features:

What's new in Zarr Spec V3? (15 mins.)

  • What is the motivation for the evolution of the specification?
    • High-latency storage → Better support for technologies, particularly systems with relatively high latency per operation, such as cloud object stores
    • Interoperability → Language-agnostic approach towards the new specification by slimming down the specification to achieve interoperability across major programming languages
  • Major design updates
    • Greater flexibility in how groups and arrays are created
      • Support for implicit groups that do not have a metadata document but whose existence is implied by descendant nodes
    • Restructuring of the JSON metadata document and storage path in both arrays and groups
      • Why is the Zarr V3 metadata consolidated compared to the Zarr V2 metadata?
    • Explicit support for extensions via defined extension points and mechanisms
      • How do extensions allow the community to add innovative and cutting-edge features to help their specific use cases?
    • Chunk encoding and supported codecs for V3
      • How are chunks encoded into binary representation for storage in the store, using the chain of codecs specified by the codecs metadata field?
  • ZEP Process
    • Need and origin of a community feedback process for the evolution of Zarr specification
    • Transformation from steering council governed to community-owned specification
    • Learnings when migrating from Spec V2Spec V3

Then, I’d be doing a hands-on session, which would cover the following:

Hands-on (5 mins.)

  • Creating Zarr arrays and groups using Zarr-Python V3.0
    • Walk through of the new features (mentioned above)
  • Demo of Sharding Codec extension
    • Creating a sharded array and group and showing how a large number of chunks can be grouped together into a single shard
  • Looking under the hood
    • Use store functions to explain how your Zarr data is stored

I'd be closing the talk by:

Conclusion (5 mins.)

  • Key takeaways
  • How can you get involved?
  • QnA

This talk aims to address an audience that works with large amounts of data and is looking for a transparent, open-source, reliable, cloud-optimised, and environmentally friendly format. Also, I’d like to invite anyone interested in the lessons I learned by maintaining the project throughout the years.

The tone of the talk is set to be informative, story-telling and fun.

Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk.

After this talk, you’d:

  • understand the basics of Zarr and what's new in V3,
  • using Zarr V3 for local and cloud storage,
  • make an informed decision on what data format to use for your data

and also you'd:

  • know why should you have a process for your project,
  • have essential takeaways regarding when an OSS project transitions from a young to a mature stage

Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Novice

Abstract as a tweet (X) or toot (Mastodon):

Hi all! I’ll discuss Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. We’ll explore the Zarr ecosystem from fundamentals to V3.0 and beyond. If you’re interested in storing massive datasets, please attend my talk. Thanks!

Sanket is a data scientist based out of New Delhi, India. He likes to build data science tools and products and has worked with startups, governments, and organisations. He loves building community and bringing everyone together and is Chair of PyData Delhi and PyData Global.

Currently, he's taking care of the community and OSS at Zarr as their Community Manager.

When he’s not working, he likes to play the violin and computer games and sometimes thinks of saving the world!