PyConDE & PyData Berlin 2024

Going beyond Parquet's default settings – be surprised what you can get
04-22, 14:35–15:05 (Europe/Berlin), B09

Apache Parquet has become the de facto format for storing tabular (DataFrame) data on disk. This is done through universal compression and efficient knowledge of the stored data structure. As part of this talk, we would like to show the core structure of Parquet and the knobs that allow you to get even more of the capabilities of the file format.


In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings.

While these default settings and access patterns already provide decent performance, by understanding the format in more detail and using recent developments, one can get much better performance, smaller files, and utilise Parquet's newer partial reading features to read even smaller subsets of a file for a given query.

This talk aims to provide insight into the Parquet format and its recent development that are useful for end users' daily workflows. One only needs prior knowledge to know what a DataFrame/tabular data is.


Expected audience expertise: Domain

Novice

Expected audience expertise: Python

Novice

Abstract as a tweet (X) or toot (Mastodon)

Only ever used pandas.to_parquet? Would you like to know what it does and how you could make it even more efficient? Find out about Parquet's newest features in this talk.

See also: Slides

Uwe Korn is a CTO at the data science company QuantCo. His expertise is in building scalable architectures for machine learning services and the teams & culture around them. Nowadays, he focuses on the data engineering infrastructure that is needed to provide the building blocks to bring machine learning models into production. As part of his work to provide an efficient data interchange, he became a core committer to the Apache Parquet, Apache Arrow and conda-forge projects.