The expanding Apache Arrow universe - standardizing and accelerating tabular data access and interchange
2024-09-25 , Gaston Berger

Apache Arrow has become a de-facto standard for efficient in-memory columnar data representation. Beyond the standardized and language-independent columnar memory format for tabular data, the Apache Arrow project also has a growing set of supplementary specifications and language implementations. This talk will give an overview of the recent developments in the Apache Arrow ecosystem, including ADBC, nanoarrow, new data types, and the Arrow PyCapsule protocol.


According to the website, Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing. At its core is the Arrow Columnar Format, a language-independent columnar memory format for tabular data. But the Arrow project encompasses many things, including serialization, messaging and database specifications and a variety of language implementations. And this ecosystem of specifications and tools (and libraries adopting those) is constantly growing.

After a brief recap of the Apache Arrow project to set the stage, this talk will give an overview of the recent developments in the Apache Arrow ecosystem: from additions to the core Columnar Format (such as the string view type), newer specifications (such as the ADBC database specification, additions to Flight SQL, and the Arrow PyCapsule protocol) and implementations (such as nanoarrow).

I am a core contributor to pandas and Apache Arrow, and a maintainer of GeoPandas. I did a PhD at Ghent University and VITO in air quality research and worked at the Paris-Saclay Center for Data Science. Currently, I work at Voltron Data, contributing to Apache Arrow, and am a freelance teacher of python (pandas) at Ghent University.