State of Parquet 2025: Structure, Optimizations, and Recent Innovations
2025-09-30 , Louis Armand 2 - Ouest

If you worked with large amounts of tabular data, chances are you have dealt with Parquet files. Apache Parquet is an open source, column-oriented data file format designed for efficient storage and retrieval. It employs high performance compression and encoding schemes to handle complex data at scale and is supported in many programming language and analytics tools.
This talk will give a technical overview of Parquet format file structure, explain how the data is represented and stored in Parquet and why and how some of the possible configuration options might better match your specific use case.

We will also highlight some recent developments the and discussions in the Parquet community including Hugging Face's proposed content defined chunking - an approach that reduces required storage space by ten percent on realistic training datasets. We will also examine the geometry and geography types added to the Parquet specification in 2025, which enable efficient storage of spatial data and have catalyzed Parquet's growing adoption within the geospatial community.


This talk is designed for data engineers, analysts, and data scientists who regularly work with large tabular datasets and want to optimize their storage and processing pipelines. It will benefit engineers seeking a deeper understanding of Parquet's internal mechanics to make informed decisions about data format configuration and those in machine learning or geospatial fields who are looking to benefit from Parquet's features.

Started as a physicist, worked as data scientist and engineer, got interested in data tooling and became an Apache Arrow and Parquet contributor, focusing on the C++ and lately Rust implementations. Would like to see numerical computation become more accessible in general purpose languages and frameworks.

Apache Arrow committer and PMC