PyConDE & PyData Berlin 2024

A deep dive into the Arrow Columnar format with pyarrow and nanoarrow
2024-04-24 , A03-A04

Apache Arrow has become a de-facto standard for efficient in-memory columnar data representation. You might have heard about Arrow or using Arrow, but do you understand the format and why it’s so useful? This tutorial will dive deep into the details of the Arrow columnar format, the different types and buffer layouts, and explore those details interactively using the pyarrow and nanoarrow libraries.


You can find the material and setup instructions at https://github.com/voltrondata-labs/2024-arrow-format-tutorial/

According to the website, Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing. Nowadays, the Arrow project encompasses many things, including serialization, messaging and database specifications and a variety of language implementations. But at its core is the Columnar Format: a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs.
This format is being used (fully or partially) by many libraries that you might know, such as pandas, polars, datafusion, duckdb, cudf, influxdb, and many more.

This tutorial will dive into the details of the Columnar format, explore the physical memory layout and the different data types. It will do so with interactive code examples using the pyarrow and nanoarrow libraries, learning how you can create and inspect Arrow data with those libraries. So at once you will also learn a bit about those two libraries, but the insights about the columnar format itself is general for any project using such data under the hood.


Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Intermediate

Abstract as a tweet (X) or toot (Mastodon):

Apache Arrow has become a de-facto standard for efficient in-memory columnar data representation. But what is this format exactly? This tutorial will dive deep into the details of the Arrow columnar format and explore interactively the different types and buffer layouts.

I am a core contributor to pandas and Apache Arrow, and a maintainer of GeoPandas. I did a PhD at Ghent University and VITO in air quality research and worked at the Paris-Saclay Center for Data Science. Currently, I work at Voltron Data, contributing to Apache Arrow, and am a freelance teacher of python (pandas) at Ghent University.

I started working with Python in 2008 with Python 2.5 and since then it became my language of choice. I have been involved in the Spanish Python community being one of the co-founders of the Python Spanish Association. I have been involved in the organisation of EuroPython in Bilbao, several PyCon ES (Spain) and the Barcelona meetup.
A couple of years ago I started working in Apache Arrow and since then I have become a committer and a PMC member and I want to share to the rest of the world what we have done and what we are doing.

My software development journey started with open source and Apache Arrow project. More specifically, I started with contributing to the Arrow R package in 2021. After that I have contributed to other open source projects connected to the Python dataframe API standard while on Quansight and became a Apache Arrow committer in 2022 after being a regular contributor to Apache Arrow (Python) since 2021. I am currently working at Voltron Data as a Software Engineer.