Apache Arrow: a cross-language development platform for in-memory data EuroSciPy 2019

Apache Arrow: a cross-language development platform for in-memory data
.ical

2019-09-04 14:45–15:15, Track 1 (Mitxelena)

Apache Arrow, defining a columnar, in-memory data format standard and communication protocols, provides a cross-language development platform with already several applications in the PyData ecosystem.

This talk discusses Apache Arrow project and how it already interacts with the Python ecosystem.

The Apache Arrow project specifies a standardized language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware. On top of that standard, it provides computational libraries and zero-copy streaming messaging and interprocess communication protocols, and as such, it provides a cross-language development platform for in-memory data. It has support for many languages, including C, C++, Java, JavaScript, MATLAB, Python, R, Rust, ..

The Apache Arrow project, although still in active development, has already several applications in the Python ecosystem. For example, it provides the IO functionality for pandas to read the Parquet format (a columnar, binary file format used a lot in the Hadoop ecosystem). Thanks to the standard memory format, it can help improve interoperability between systems, and this is already seen in practice for the Spark / Python interface, by increasing the performance of PySpark. Further, it has the potential to provide a more performant string data type and nested data types (like dicts or lists) for Pandas dataframes, which is already being experimented with in the fletcher package (using the pandas ExtensionArray interface).

Project Homepage / Git:

https://arrow.apache.org/

Project Homepage / Git:

https://github.com/apache/arrow

Abstract as a tweet:

Apache Arrow: cross-language development platform for tabular, in-memory data and how it relates to the PyData ecosystem

Python Skill Level:

basic

Domain Expertise:

none

Domains:

Big Data, Open Source, Vector and array manipulation

Joris Van den Bossche

I am a core contributor to Pandas and maintainer of GeoPandas. I have given several tutorials at international conferences and a course on python for data analysis for PhD students at Ghent University. I did a PhD at Ghent University and VITO in air quality research, worked at the Paris-Saclay Center for Data Science, and, currently I am a freelance software developer and teacher.

This speaker also appears in:

Apache Arrow: a cross-language development platform for in-memory data .ical 2019-09-04 14:45–15:15, Track 1 (Mitxelena)

Apache Arrow: a cross-language development platform for in-memory data
.ical

2019-09-04 14:45–15:15, Track 1 (Mitxelena)