2026-06-06 –, Grand Hall 2
The Python data ecosystem is migrating from NumPy-based arrays toward Apache Arrow. Polars is built entirely on Arrow, and Pandas is heading in the same direction. Yet differences in string encoding, missing values, schemas, and index metadata make interoperability between the two formats surprisingly costly and error-prone. This talk examines these challenges through a case study of how ArcticDB, the open-source client-side dataframe database, navigated this same migration.
As organisations adopt Polars alongside Pandas, a critical question emerges: how do you move data between the two without silent data loss, performance regressions, or broken round-trips? The answer is more complex than calling polars.from_pandas.
Pandas stores data in NumPy arrays by default, though as of 3.0 it uses Arrow for strings. Polars is built entirely on Apache Arrow's columnar format. For each area where these formats diverge, this talk will explain the problem and show how ArcticDB, a dataframe database that must serialize, store, and reconstruct both formats, solves it in practice:
- Memory layout: How NumPy and Arrow represent the same logical data differently, and how a dataframe database can bridge the two
- Strings: NumPy object arrays vs. Arrow's offset-based binary buffers -- why Arrow is dramatically more efficient and the cost of conversion
- Missing values: NaN/NaT/None sentinels vs. Arrow's validity bitmask -- why a Pandas NaN behaves differently from a Polars null and what breaks during conversion
- Schema differences: Different supported data types and different allowed column names -- e.g. Pandas allows mixed-type columns that Arrow cannot represent
- Pandas-specific metadata that has no Arrow equivalent: Index and RangeIndex semantics, and MultiIndex which uses an entirely different memory layout with its own performance implications
Together, these issues make conversion between Pandas and Polars far from trivial. This is especially challenging for a dataframe database like ArcticDB, where petabytes of Pandas DataFrames are stored and users increasingly want to read them back as Arrow. The talk will include benchmarks comparing native format reads against conversion-based approaches, and practical takeaways for anyone migrating a codebase, building a library that supports both formats, or choosing a dataframe database. The talk will include benchmarks comparing native format reads against conversion-based approaches, and practical takeaways for anyone migrating a codebase, building a library that supports both formats, or choosing a dataframe database.
Ivo Dilov has 5 years of industry experience and 10 years of competitive programming, with a focus on high-performance software. For the past 2 and a half years, he has been a senior engineer on ArcticDB, the open-source DataFrame database backed by Man Group and Bloomberg, working in C++ and Python.