PyCon DE & PyData 2026

Open Table Formats in the Wild™ - Reloaded: Vortexing Ducks over Floating Icebergs
, Titanium [2nd Floor]

Open table formats have almost freed us from vendor lock-in. They form a critical building block of the modern, composable data stack. The most prominent open table format is Apache Iceberg - not only because of its storage layout, but also due to its REST catalog specification. Iceberg has gained significant traction through a recent stream of feature announcements from the community itself, major cloud providers like AWS, and data platform leaders such as Snowflake and Databricks.

But cutting through the hype: how does Iceberg actually perform in the real world if you are not Netflix or Apple which are capable of Building Your Own Snowflake (BYOS)? Can you realistically migrate from legacy solutions to Iceberg and enjoy all its promises without tradeoffs?

That, of course, is a rhetorical question. Some even argue that Iceberg got parts of the specification fundamentally wrong!?!

Curious? Join me for another episode of Open Table Formats in the Wild™. Expect a practical look at the current state of Apache Iceberg and Apache Parquet, alongside a gentle introduction to DuckLake and Vortex as promising contenders for table and file formats, respectively.


Description

The core promise of open table formats is engine interoperability with ACID guarantees, mutability, and schema evolution for massive datasets stored on cheap, reliable cloud object storage. Modern data platforms demand far more than just interoperable, analytical batch processing. Engineers now require native support for CDC, incremental processing, streaming workloads, low-latency access, and point lookups - especially for AI-driven applications. Ideally, all of this would be covered by a single, unified solution.

However, Parquet - the foundational format for physically storing much of today’s data - predates both the AI boom and the era of unified batch and streaming systems. Likewise, Iceberg’s original design DNA was firmly rooted in large-scale, batch-oriented analytical workloads. This raises an uncomfortable question: are Parquet and Iceberg truly up to the task?

This talk explores that question through real-world use cases and architectural constraints. While the focus is on conveying key ideas and practical insights, the session is aimed at an intermediate to advanced audience. If you are new to the topic, you may want to watch last year’s episode on Apache Parquet and Delta Lake, which provides a gentle introduction to the fundamentals of open table formats.

Takeaways

After this talk, attendees will:
- Understand why incremental processing is not a native concept in Apache Iceberg
- Recognize how Iceberg’s metadata model creates hard limits for low-latency streaming workloads
- Learn why Parquet’s physical layout becomes a bottleneck for point lookups and AI-driven access patterns
- Get an early look at DuckLake and Vortex as emerging alternatives

Agenda

The Past (10 min)
- Rationale - The Idealized Model
- Implications - The Engineering Trade-offs

The Present (15 min)
- Incremental Processing - The Missing Primitive
- Streaming Workloads - The Batch Inheritance
- AI Applications & Point Lookups - The Access Wall

The Future (15 min)
- DuckLake - The Return of Relational Databases
- Vortex - The Parquet of Tomorrow


Expected audience expertise in your talk's domain:: Intermediate Expected audience expertise in Python:: Novice

Hi my name is Franz and I’m an open source and python enthuisiast:

  • father of 3 girls
  • major in psychology
  • chess hobbiyst
  • former competitive ultimate frisbee player
  • likes cooking and baking sourdough bread