2025-04-23 –, Palladium
A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.
This talk presents a systematic approach to understanding and implementing the newer version of Zarr-Python, i.e. Zarr-Python 3 by explaining the new API, deprecations, new storage backend, improved codec pipeline, etc.
I will also show the performance improvements in ZP-3 while creating, reading and writing async Zarr arrays across local and remote storage like AWS S3.
Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by NumFOCUS under their umbrella.
It is based on open-source technical specification and has implementations in several languages, with Zarr-Python being the most used.
After the successful adoption of Specification V3, our team has worked tirelessly over the last year to ensure the Python library's compliance with the latest spec.
Outline
First, I’d be talking about:
Understanding Zarr basics (5 mins.)
- What is Zarr, and how it works?
- The inner workings of Zarr using illustrated graphics
- What is the Zarr Specification?
- What's new in Zarr Spec V3?
Then, I'll be talking about the new Zarr-Python 3 and its significant features:
What's new in Zarr-Python 3? (15 mins.)
- Major design updates
- New storage backend
- Creating Zarr arrays and groups asynchronously
- New and improved codec pipeline
- Native GPU support for creating and writing arrays
- Changes and deprecations
- Overview of the new API
- Optimising performance for large arrays
- Deprecation of several stores like LMDBStore, SQLStore, MongoDBStore, etc.
- 3.0 Migration guide
- Steps to migrate from Zarr-Python 2 to Zarr-Python 3
- Extensions
- How can Zarr-Python 3 be extended to add new custom data types, stores, chunking strategies, etc.?
Then, I’d be doing a hands-on session, which would cover the following:
Hands-on (5 mins.)
- Creating Zarr arrays and groups using Zarr-Python 3
- Plus walkthrough of the new features (mentioned above)
- Writing and reading from Cloud object storage
- Using S3/GCS/Azure to create Zarr arrays and write data to it
- Looking under the hood
- Use store and info functions to explain how your Zarr data is stored and display important information
Conclusion (5 mins.)
- Key takeaways
- How can you get involved?
- QnA
This talk aims to address an audience that works with large amounts of data and is looking for a transparent, open-source, reliable, cloud-optimised, and environmentally friendly format.
The tone of the talk is set to be informative, story-telling and fun.
Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk.
After this talk, you’d:
- understand the basics of Zarr and what's new in V3,
- leverage the new functionalities of Zarr-Python 3 with improved performance,
- make an informed decision on what data format to use for your data
Intermediate
Expected audience expertise: Python:Novice
Sanket is a data scientist based out of New Delhi, India. He likes to build data science tools and products and has worked with startups, governments, and organisations. He loves building community and bringing everyone together and is Chair of PyData Delhi and PyData Global.
Currently, he's taking care of the community and OSS at Zarr as their Community Manager.
When he’s not working, he likes to play the violin and computer games and sometimes thinks of saving the world!