From Tensors to Clouds — A Practical Guide to Zarr V3 and Zarr-Python 3 :: PyCon DE & PyData 2025

From Tensors to Clouds — A Practical Guide to Zarr V3 and Zarr-Python 3
.ical

2025-04-23 12:25–12:55, Palladium

A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.

This talk presents a systematic approach to understanding and implementing the newer version of Zarr-Python, i.e. Zarr-Python 3 by explaining the new API, deprecations, new storage backend, improved codec pipeline, etc.

Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by NumFOCUS under their umbrella.

It is based on open-source technical specification and has implementations in several languages, with Zarr-Python being the most used.

After the successful adoption of Specification V3, our team has worked tirelessly over the last year to ensure the Python library's compliance with the latest spec.

Outline

First, I’d be talking about:

Understanding Zarr basics (5 mins.)

What is Zarr, and how it works?
- The inner workings of Zarr using illustrated graphics
What is the Zarr Specification?
- What's new in Zarr Spec V3?

Then, I'll be talking about the new Zarr-Python 3 and its significant features:

What's new in Zarr-Python 3? (15 mins.)

Major design updates
- New storage backend
- Creating Zarr arrays and groups asynchronously
- New and improved codec pipeline
- Native GPU support for creating and writing arrays
Changes and deprecations
- Overview of the new API
- Optimising performance for large arrays
- Deprecation of several stores like LMDBStore, SQLStore, MongoDBStore, etc.
3.0 Migration guide
- Steps to migrate from Zarr-Python 2 to Zarr-Python 3
Extensions
- How can Zarr-Python 3 be extended to add new custom data types, stores, chunking strategies, etc.?

Then, I’d be doing a hands-on session, which would cover the following:

Hands-on (5 mins.)

Creating Zarr arrays and groups using Zarr-Python 3
- Plus walkthrough of the new features (mentioned above)
Looking under the hood
- Use store and info functions to explain how your Zarr data is stored and display important information

Conclusion (5 mins.)

Key takeaways
How can you get involved?
QnA

This talk aims to address an audience that works with large amounts of data and is looking for a transparent, open-source, reliable, cloud-optimised, and environmentally friendly format.

The tone of the talk is set to be informative, story-telling and fun.

Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk.

After this talk, you’d:

understand the basics of Zarr and what's new in V3,
leverage the new functionalities of Zarr-Python 3 with improved performance,
make an informed decision on what data format to use for your data

Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Novice

From Tensors to Clouds — A Practical Guide to Zarr V3 and Zarr-Python 3 .ical 2025-04-23 12:25–12:55, Palladium