Apache Arrow: a cross-language development platform for in-memory data
2019-09-04, 14:45–15:15, Track 1 (Mitxelena)

Apache Arrow, defining a columnar, in-memory data format standard and communication protocols, provides a cross-language development platform with already several applications in the PyData ecosystem.


This talk discusses Apache Arrow project and how it already interacts with the Python ecosystem.

The Apache Arrow project specifies a standardized language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware. On top of that standard, it provides computational libraries and zero-copy streaming messaging and interprocess communication protocols, and as such, it provides a cross-language development platform for in-memory data. It has support for many languages, including C, C++, Java, JavaScript, MATLAB, Python, R, Rust, ..

The Apache Arrow project, although still in active development, has already several applications in the Python ecosystem. For example, it provides the IO functionality for pandas to read the Parquet format (a columnar, binary file format used a lot in the Hadoop ecosystem). Thanks to the standard memory format, it can help improve interoperability between systems, and this is already seen in practice for the Spark / Python interface, by increasing the performance of PySpark. Further, it has the potential to provide a more performant string data type and nested data types (like dicts or lists) for Pandas dataframes, which is already being experimented with in the fletcher package (using the pandas ExtensionArray interface).


Domains – Big Data, Open Source, Vector and array manipulation Project Homepage / Git – https://github.com/apache/arrow Domain Expertise – none Python Skill Level – basic Project Homepage / Git – https://arrow.apache.org/ Abstract as a tweet – Apache Arrow: cross-language development platform for tabular, in-memory data and how it relates to the PyData ecosystem