2023-07-26 –, 32-123
The parquet tabular data storage format has become one of the most ubiquitous, particularly in "big data" contexts where it is arguably the only binary format to successfully supplant CSV. Despite this, there are relatively few implementations of parquet, which, historically, has presented challenges for Julia. I will give a brief overview of Parquet2.jl, a pure Julia parquet implementation including comparison to other tools and formats and what is still needed to reach parity with pyarrow.
We will touch on the following:
- Why did I write Parquet2.jl when Parquet.jl already existed?
- Extremely quick overview of features.
- Answering the often asked question: which format should I use?
- A very brief mention of some idiosyncrasies of the format, some challenges of testing against the JVM implementation and why edge cases pop up.
- What features are missing? How far is this from parity with the pyarrow
implementation?
My educational background is in both experimental and theoretical high energy physics. My initial programming experience mostly centered around scientific/numerical computing in C++ and fortran. Since receiving my PhD I have been working as a data scientist, and have been using Julia primarily both in and outside my job for almost 7 years. My recent programming experiences have involved both convex and non-convex optimization, including large-scale mixed integer conic programming, as well as machine learning and statistics. My interest in Julia and other new languages such as zig
, as well as my enthusiasm for the broader Linux ecosystem has also caused me to spend a lot of time with serialization, IPC and network protocols. I also enjoy video games which has led me to watch projects around gaming on Linux, and as a guitar player I'm also interested in audio.