The Beauty of Zarr PyCon DE & PyData Berlin 2023

The Beauty of Zarr
.ical
2023-04-19 14:35–15:05, B05-B06

In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d mainly talk about Zarr’s Python implementation and show how it beautifully interoperates with the existing libraries in the PyData stack.

Zarr is a data format for storing chunked, compressed N-dimensional arrays. Zarr is based on open-source technical specification and has implementations in several languages, with Zarr-Python being the most used. Zarr is NumFOCUS’s sponsored project and is under their umbrella.

Outline:

First, I’d be talking about:

What’s, Why’s, and How’s of Zarr (15 mins.)

How does Zarr work?
- Talking about the motivation and functionality of Zarr
What’s the need for using Zarr?
- When, where and why to use it?
Pluggable compressors and file-storage
- Talking about several compressors and file-storage systems available in Zarr
Managing(selection, resizing, writing, reading) chunked arrays using Zarr functions
- Using inbuilt functions to manage compressed chunks
How is Zarr different when compared to other storage formats?
- Talking briefly about technical specification, which allows Zarr to have implementations in several languages
- Pros and cons when compared to other storage formats
Zarr community
- What is the Zarr community, and how do we do things?

Then, I’d be doing a hands-on session, which would cover the following:

Hands-on (10 mins.)

Creating and using Zarr arrays
- Using inbuilt functions to create Zarr arrays and reading and writing data to it
Looking under the hood
- Use store functions to explain how your Zarr data is stored
Consolidating metadata
- Consolidating the metadata for an entire group into a single object
Writing and reading from Cloud object storage
- Using S3/GCS/Azure to create Zarr arrays and write data to it
Showing how Zarr interoperates with the PyData stack
- How Zarr interoperates with the PyData stack(NumPy, Dask and Xarray) and how you can write data to your Zarr chunks at incredibly high speed in parallel using Dask

I’d be closing the talk by:

Conclusion(5 mins.)

Key takeaway
How can you contribute to Zarr?
QnA

This talk aims to address the audience who works with large amounts of data and are in search of a data format which is transparent, easy to use and friendly to the environment. Zarr is also reasonably used in bioimaging, geospatial and research communities. So, Zarr is your one-stop solution if you’re from a community or an organisation dealing with high-volume data. Also, anyone curious and wants to learn about Zarr and how to use it is most welcome.

The tone of the talk is set to be informative, along with a hands-on session. Also, I’m happy to adjust the style according to the audience in the room.

Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk.

After this talk, you’d learn:

Basic use cases for Zarr and how to use it
Understand the basics of data storage in Zarr
Understand the basics of compressors and file-storage systems in Zarr
Take a better and more informed decision on what data format to use for your data

Expected audience expertise: Domain: Novice Expected audience expertise: Python: Intermediate Abstract as a tweet:

Hi all, I’ll be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays, along with a hands-on session. If you work with huge datasets in local/cloud storage and looking for an efficient format, please attend my talk. Thanks!

Sanket Verma

Sanket is a data scientist based out of New Delhi, India. He likes to build data science tools and products and has worked with startups, government and organisations. He loves building community and bringing everyone together and is Chair of PyData Delhi and PyData Global. Currently, he's taking care of the community and OSS at Zarr as their Community Manager.
When he’s not working, he likes to play the violin and computer games and sometimes thinks of saving the world!