The Column's the limit: interactive exploration of larger than memory data sets in a notebook with Polars and Buckaroo PyData Boston 2025

The Column's the limit: interactive exploration of larger than memory data sets in a notebook with Polars and Buckaroo
.ical
2025-12-10 09:45–10:25, Horace Mann

Notebooks struggle when data vastly exceeds RAM: pagination hacks, fragile sampling, and surprise OOMs. Buckaroo is a modern data table for notebooks built to quickly make sense of dataframes by providing search, summary stats, and scrolling with every view. This talk reviews how Buckaroo uses out‑of‑core design patterns, viewport streaming, lazy Polars pipelines, batched background stats, and a series cache to make interactive exploration fast and reliable on commodity laptops. We’ll walk through the lifecycle of opening a large Parquet/CSV file: detecting formats, avoiding full materialization, fetching only requested row/column ranges, and throttling UI updates for smoothness. We’ll show how column‑level hashing (via a lightweight Rust extension) enables stable, cache keys so warm loads render the first viewport and stats in under a second. CSV specifics and a practical CSV→Parquet streaming path round out the approach. The ideas are tool‑agnostic and reproducible with the open‑source PyData stack; Buckaroo serves as a concrete reference implementation. You’ll leave with guidelines and snippets to bring these patterns to your own workflows.

Exploring huge local files in notebooks usually crashes RAM or forces painful sampling. Buckaroo takes a different path: it reands and renders only the visible slice of the infinitely scrollable table, defers all heavy lifting to Polars’ lazy engine, and computes summary stats in the background updating the UI as results are ready. A Rust-based Polars plugin hashes series for column-level caching running inside the Polars engine, stable, versioned hashes so warm loads show the first viewport and stats in under 500ms and cache hits remain correct across sessions.

Typical best practices favor sampling and minimal computation, but in notebooks that means constant retries, tuning, and crashes. Buckaroo turns scale into a safe default: a cache‑aware, lazy, column‑wise DAG streams only what’s needed for interactive analysis. New summary stats plug in at runtime and inherit the same out‑of‑core guarantees. Our reliability bar is simple: if the biggest column fits in half your RAM, a crash is a bug so you can open any dataframe without sizing rituals.

This talk distills the out-of-core patterns behind Buckaroo viewport streaming, lazy Polars pipelines, background workers, Rust-assisted hashing, and correctness-preserving cache invalidation - so attendees can build similar experiences or get more from Polars when data dwarfs RAM.

Prior Knowledge Expected: No previous knowledge expected

Paddy Mullen

Paddy Mullen is a full‑stack engineer and data‑tooling builder. An early employee at Anaconda, he contributed to the Bokeh visualization library. He has built data tools and led teams at hedge funds and startups. Since 2023 he has been developing Buckaroo, an interactive dataframe viewer for notebook environments.

The Column's the limit: interactive exploration of larger than memory data sets in a notebook with Polars and Buckaroo .ical 2025-12-10 09:45–10:25, Horace Mann

The Column's the limit: interactive exploration of larger than memory data sets in a notebook with Polars and Buckaroo
.ical
2025-12-10 09:45–10:25, Horace Mann