2019-09-04, 14:45–15:15, Track 1 (Mitxelena)
Apache Arrow, defining a columnar, in-memory data format standard and communication protocols, provides a cross-language development platform with already several applications in the PyData ecosystem.
This talk discusses Apache Arrow project and how it already interacts with the Python ecosystem.
The Apache Arrow project, although still in active development, has already several applications in the Python ecosystem. For example, it provides the IO functionality for pandas to read the Parquet format (a columnar, binary file format used a lot in the Hadoop ecosystem). Thanks to the standard memory format, it can help improve interoperability between systems, and this is already seen in practice for the Spark / Python interface, by increasing the performance of PySpark. Further, it has the potential to provide a more performant string data type and nested data types (like dicts or lists) for Pandas dataframes, which is already being experimented with in the fletcher package (using the pandas ExtensionArray interface).