PyData Boston 2025

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Monday, Dec. 8, 2025

Tuesday, Dec. 9, 2025

Wednesday, Dec. 10, 2025

09:00

From Notebook to Pipeline: Hands-On Data Engineering with Python

Gilberto Hernandez

In this hands-on tutorial, you'll go from a blank notebook to a fully orchestrated data pipeline built entirely in Python, all in under 90 minutes. You'll learn how to design and deploy end-to-end data pipelines using familiar notebook environments, using Python for your data loading, data transformations, and insights delivery.

We'll dive into the Ingestion-Tranformation-Delivery (ITD) framework for building data pipelines: ingest raw data from cloud object storage, transform the data using Python DataFrames, and deliver insights via a Streamlit application.

Basic familiarity with Python (and/or SQL) is helpful, but not required. By the end of the session, you'll understand practical data engineering patterns and leave with reusable code templates to help you build, orchestrate, and deploy data pipelines from notebook environments.

10:30

10:30

30min

Break

Abigail Adams

10:30

30min

Break

Thomas Paul

12:30

12:30

60min

Lunch

Horace Mann

12:30

60min

Lunch

Abigail Adams

12:30

60min

Lunch

Thomas Paul

13:30

Building LLM Agents Made Simple

Learn to build practical LLM agents using LlamaBot and Marimo notebooks. This hands-on tutorial teaches the most important lesson in agent development: start with workflows, not technology.

We'll build a complete back-office automation system through three agents: a receipt processor that extracts data from PDFs, an invoice writer that generates documents, and a coordinator that orchestrates both. This demonstrates the fundamental pattern for agent systems—map your boring workflows first, build focused agents for specific tasks, then compose them so agents can use other agents as tools.

By the end, you'll understand how to identify workflows worth automating, build agents with decision-making loops, compose agents into larger systems, and integrate them into your own work. You'll leave with working code and confidence to automate repetitive tasks.

Prerequisites: Intermediate Python, familiarity with APIs, basic LLM understanding. Participants should have Ollama and models installed beforehand (setup instructions provided).

Materials: GitHub repository with Marimo notebooks. Setup uses Pixi for dependency management.

Learn to Unlock Document Intelligence with Open-Source AI

Unlocking the full potential of AI starts with your data, but real-world documents come in countless formats and levels of complexity. This session will give you hands-on experience with Docling, an open-source Python library designed to convert complex documents into AI-ready formats. Learn how Docling simplifies document processing, enabling you to efficiently harness all your data for downstream AI and analytics applications.

15:00

15:00

30min

Break

Horace Mann

15:00

30min

Break

Abigail Adams

15:00

30min

Break

Thomas Paul

15:30

"Save your API Keys for someone else" -- Using the HuggingFace and Ollama ecosystems to run good-enough LLMs on your laptop

Ian Stokes-Rees

In this 90 minute tutorial we'll get anyone with some basic Python and Command Line skills up and running with their own 100% laptop based set of LLMs, and explain some successful patterns for leveraging LLMs in a data analysis environment. We'll also highlight pit-falls waiting to catch you out, and encourage you that your pre-GenAI analytics skills are still relevant today and likely will be for the foreseeable future by demonstrating the limits of LLMs for data analysis tasks.

Generative Programming with Mellea: from Agentic Soup to Robust Software

Nathan Fulton, Jake Lorocco

Agentic frameworks make it easy to build and deploy compelling demos. But building robust systems that use LLMs is difficult because of inherent environmental non-determinism. Each user is different, each request is different; the very flexibility that makes LLMs feel magical in-the-small also makes agents difficult to wrangle in-the-large.

Developers who have built large agentic-like systems know the pain. Exceptional cases multiply, prompt libraries grow, instructions are co-mingled with user input. After a few iterations, an elegant agent evolves into a big ball of mud.

This hands-on tutorial introduces participants to Mellea, an open-source Python library for writing structured generative programs. Mellea puts the developer back in control by providing the building blocks needed to circumscribe, control, and mediate essential non-determinism.

Going multi-modal: How to leverage the lastest multi-modal LLMs and deep learning models on real world applications

Multimodal deep learning models continue improving rapidly, but creating real-world applications that effectively leverage multiple data types remains challenging. This hands-on tutorial covers model selection, embedding storage, fine-tuning, and production deployment through two practical examples: a historical manuscript search system and flood forecasting with satellite imagery and time series data.

08:00

08:00

60min

Registration & Breakfast

Horace Mann

09:00

09:00

15min

Opening Notes

Horace Mann

09:15

09:15

45min

Keynote: Isabel Zimmerman

Horace Mann

10:40

10:40

35min

Break

Horace Mann

12:00

Where Have All the Metrics Gone?

Dr. Rebecca Bilbro

How exactly does one validate the factuality of answers from a Retrieval-Augmented Generation (RAG) system? Or measure the impact of the new system prompt for your customer service agent? What do you do when stakeholders keep asking for "accuracy" metrics that you simply don't have? In this talk, we’ll learn how to define (and measure) what “good” looks like when traditional model metrics don’t apply.

12:40

12:40

65min

Lunch

Horace Mann

13:45

13:45

45min

Keynote: Lisa Amini

Horace Mann

14:30

The SAT math gap: gender difference or selection bias?

Why do male test takers consistently score about 30 points higher than female test takers on the mathematics section of the SAT? Does this reflect an actual difference in math ability, or is it an artifact of selection bias—if young men with low math ability are less likely to take the test than young women with the same ability?

This talk presents a Bayesian model that estimates how much of the observed difference can be explained by selection effects. We’ll walk through a complete Bayesian workflow, including prior elicitation with PreliZ, model building in PyMC, and validation with ArviZ, showing how Bayesian methods disentangle latent traits from observed outcomes and separate the signal from the noise.

No prior knowledge of Bayesian statistics is required; attendees should be familiar with Python and common probability distributions.

15:10

15:10

35min

New break

Horace Mann

16:30

16:30

60min

Lightning Talks

Horace Mann

08:00

08:00

60min

Breakfast & Registration

Horace Mann

09:00

Wrappers and Extenders: Companion Packages for Python Projects

Jules Walzer-Goldfeld

Many Python users want features that don’t fit within the boundaries of their favorite libraries. Instead of forking or waiting on a pull request, you can build your own wrapper or extender package. This talk introduces the principles of designing companion packages that enhance existing libraries without changing their core code, using gt-extras as a case study. You’ll learn how to structure, document, and distribute your own add-ons to extend the tools you rely on.

09:45

Rethinking Feature Importance: Evaluating SHAP and TreeSHAP for Tree-Based Machine Learning Models

Tree-based machine learning models such as XGBoost, LightGBM, and CatBoost are widely used, but understanding their predictions remains challenging. SHAP (SHapley Additive exPlanations) provides feature attributions based on Shapley values, yet its assumptions — feature independence, additivity, and consistency — are often violated in practice, potentially producing misleading explanations.
This talk critically examines SHAP’s limitations in tree-based models and introduces TreeSHAP, its specialized implementation for decision trees. Rather than presenting it as perfect, we evaluate its effectiveness, highlighting where it succeeds and where explanations remain limited. Attendees will gain a practical, critical understanding of SHAP and TreeSHAP, and strategies for interpreting tree-based models responsibly.

Target audience: Data scientists, ML engineers, and analysts familiar with tree-based models.
Background: Basic understanding of feature importance and model interpretability.

The Column's the limit: interactive exploration of larger than memory data sets in a notebook with Polars and Buckaroo

Notebooks struggle when data vastly exceeds RAM: pagination hacks, fragile sampling, and surprise OOMs. Buckaroo is a modern data table for notebooks built to quickly make sense of dataframes by providing search, summary stats, and scrolling with every view. This talk reviews how Buckaroo uses out‑of‑core design patterns, viewport streaming, lazy Polars pipelines, batched background stats, and a series cache to make interactive exploration fast and reliable on commodity laptops. We’ll walk through the lifecycle of opening a large Parquet/CSV file: detecting formats, avoiding full materialization, fetching only requested row/column ranges, and throttling UI updates for smoothness. We’ll show how column‑level hashing (via a lightweight Rust extension) enables stable, cache keys so warm loads render the first viewport and stats in under a second. CSV specifics and a practical CSV→Parquet streaming path round out the approach. The ideas are tool‑agnostic and reproducible with the open‑source PyData stack; Buckaroo serves as a concrete reference implementation. You’ll leave with guidelines and snippets to bring these patterns to your own workflows.

Deborah Sampson

10:30

10:30

30min

Break

Horace Mann

11:00

Accelerating Geospatial Analysis with GPUs

Jaya Venkatesh, Jacob Tomlinson, Naty Clementi

Geospatial analysis often relies on raster data, n‑dimensional arrays where each cell holds a spatial measurement. Many raster operations, such as computing indices, statistical analysis, and classification, are naturally parallelizable and ideal for GPU acceleration.

This talk demonstrates an end‑to‑end GPU‑accelerated semantic segmentation pipeline for classifying satellite imagery into multiple land cover types. Starting with cloud-hosted imagery, we will process data in chunks, compute features, train a machine learning model, and run large-scale predictions. This process is accelerated with the open-source RAPIDS ecosystem, including Xarray, cuML, and Dask, often requiring only minor changes to familiar data science workflows.

Attendees who work with raster data or other parallelizable, computationally intensive workflows will benefit most from this talk, which focuses on GPU acceleration techniques. While the talk draws from geospatial analysis, key geospatial concepts will be introduced for beginners. The methods demonstrated can be applied broadly across domains to accelerate large-scale data processing.

fastplotlib: driving scientific discovery through data visualization

Caitlin Lewis, Kushal Kolar

Fast interactive visualization remains a considerable barrier in analysis pipelines for large neuronal datasets. Here, we present fastplotlib, a scientific plotting library featuring an expressive API for very fast visualization of scientific data. Fastplotlib is built upon pygfx, which utilizes the GPU via WGPU, allowing it to interface with modern graphics APIs such as Vulkan for fast rendering of objects. Fastplotlib is non-blocking, allowing for interactivity with data after plot generation. Ultimately, fastplotlib is a general-purpose scientific plotting library that is useful for fast and live visualization and analysis of complex datasets.

11:45

Embracing Noise: How Data Corruption Can Make Models Smarter

Machine learning often assumes clean, high-quality data. Yet the real world is noisy, incomplete, and messy, and models trained only on sanitized datasets become brittle. This talk explores the counterintuitive idea that deliberately corrupting data during training can make models more robust. By adding structured noise, masking inputs, or flipping labels, we can prevent overfitting, improve generalization, and build systems that survive real world conditions. Attendees will leave with a clear understanding of why “bad data” can sometimes lead to better models.

Deborah Sampson

12:30

12:30

60min

Lunch

Horace Mann

13:30

Is Your LLM Evaluation Missing the Point?

Your LLM evaluation suite shows 93% accuracy. Then domain experts point out it's producing catastrophically wrong answers for real-world use cases. This talk explores the collaboration gap between AI engineers and domain experts that technical evaluation alone cannot bridge. Drawing from government, healthcare, and civic tech case studies, we'll examine why tools like PromptFoo, DeepEval, and RAGAS are necessary but insufficient and how structured collaboration with domain stakeholders reveals critical failures invisible to standard metrics. You'll leave with practical starting points for building cross-functional evaluation that catches problems before deployment.

Tracking Policy Evolution Through Clustering: A New Approach to Temporal Pattern Analysis in Multi-Dimensional Data

Sarthak Pattnaik

Analyzing how patterns evolve over time in multi-dimensional datasets is challenging—traditional time-series methods often struggle with interpretability when comparing multiple entities across different scales. This talk introduces a clustering-based framework that transforms continuous data into categorical trajectories, enabling intuitive visualization and comparison of temporal patterns.What & Why: The method combines quartile-based categorization with modified Hamming distance to create interpretable "trajectory fingerprints" for entities over time. This approach is particularly valuable for policy analysis, economic comparisons, and any domain requiring longitudinal pattern recognition.Who: Data scientists and analysts working with temporal datasets, policy researchers, and anyone interested in comparative analysis across entities with different scales or distributions.Type: Technical presentation with practical implementation examples using Python (pandas, scikit-learn, matplotlib). Moderate mathematical content balanced with intuitive visualizations.Takeaway: Attendees will learn a novel approach to temporal pattern analysis that bridges the gap between complex statistical methods and accessible, policy-relevant insights. You'll see practical implementations analyzing 60+ years of fiscal policy data across 8 countries, with code available for adaptation to your own datasets.

14:15

Evaluating AI Agents in production with Python

Susan Shu Chang

This talk covers methods of evaluating AI Agents, with an example of how the speakers built a Python-based evaluation framework for a user-facing AI Agent system which has been in production for over a year. We share tools and Python frameworks used (as well as tradeoffs and alternatives), and discuss methods such as LLM-as-Judge, rules-based evaluations, ML metrics used, as well as selection tradeoffs.

Processing large JSON files without running out of memory

Itamar Turner-Trauring

If you need to process a large JSON file in Python, it’s very easy to run out of memory while loading the data, leading to a super-slow run time or out-of-memory crashes. In this talk you'll learn:

How to measure memory usage.
Why loading JSON takes a lot of memory.
Four different ways to reduce memory usage when loading large JSON files.

Unlocking Smarter Typeahead Search: A Hybrid Framework for Large-Scale Query Suggestions

Brandon (Anbang) Wu

We present a hybrid framework for typeahead search that combines prefix matching with semantic retrieval using open-source tools. Applied at Quizlet, it indexed 200 million terms and improved coverage, boosted relevance, and lifted suggestion engagement by up to 37 percent—offering a reusable approach for building scalable, robust query suggestions.

15:00

15:00

30min

Break

Horace Mann

15:30

MMM Open- Source Showdown: A Practitioner's Benchmark of PyMC-Marketing vs. Google Meridian

Your Marketing Mix Model is only as good as the library you build it on. But how do you choose between PyMC-Marketing and Google Meridian when the feature lists look so similar? You need hard evidence, not marketing claims. Which library is actually faster on multi-geo data? Do their different statistical approaches (splines vs. Fourier series) lead to different budget decisions?

This talk delivers that evidence. We present a rigorous, open-source benchmark that stress-tests both libraries on the metrics that matter in production. Using a synthetic dataset that replicates real-world ad spend patterns, we measure:

Speed: Effective sample size per second (ESS/s) across different data scales.
Accuracy: How well each model recovers both sales figures and true channel contributions.
Reliability: A deep dive into convergence diagnostics and residual analysis.
Resources: The real memory cost of fitting these models.

You'll walk away from this session with a clear, data-driven verdict, ready to choose the right tool and defend that choice to your team.

Surviving the Agentic Hype with Small Language Models

Serhii Sokolenko

The AI landscape is abuzz with talk of "agentic intelligence" and "autonomous reasoning." But beneath the hype, a quieter revolution is underway: Small Language Models (SLMs) are starting to perform the core reasoning and orchestration tasks once thought to require massive LLMs. In this talk, we’ll demystify the current state of “AI agents,” show how compact models like Phi-2, xLAM 8B, and Nemotron-H 9B can plan, reason, and call tools effectively, and demonstrate how you can deploy them on consumer-grade hardware. Using Python and lightweight frameworks such as LangChain, we’ll show how anyone can quickly build and experiment with their own local agentic systems. Attendees will leave with a grounded understanding of agent architectures, SLM capabilities, and a roadmap for running useful agents without the GPU farm.