State of Parquet 2025: Structure, Optimizations, and Recent Innovations
Rok Mihevc, Raúl Cumplido
If you worked with large amounts of tabular data, chances are you have dealt with Parquet files. Apache Parquet is an open source, column-oriented data file format designed for efficient storage and retrieval. It employs high performance compression and encoding schemes to handle complex data at scale and is supported in many programming language and analytics tools.
This talk will give a technical overview of Parquet format file structure, explain how the data is represented and stored in Parquet and why and how some of the possible configuration options might better match your specific use case.
We will also highlight some recent developments the and discussions in the Parquet community including Hugging Face's proposed content defined chunking - an approach that reduces required storage space by ten percent on realistic training datasets. We will also examine the geometry and geography types added to the Parquet specification in 2025, which enable efficient storage of spatial data and have catalyzed Parquet's growing adoption within the geospatial community.