PyData Paris 2025

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Tuesday, Sept. 30, 2025

Wednesday, Oct. 1, 2025

Thursday, Oct. 2, 2025

08:00

60min

Registration & Welcome Coffee

Gaston Berger

09:00

15min

Opening session

Gaston Berger

09:15

45min

You Don’t Have to Be an Expert: Stories from the Open Source Frontlines

Alenka Frim

Four years ago, I had no idea what PyArrow was—or how open source development worked. But through mentorship, collaboration, and learning in public, I found not just a place in the community, but a sense of how open source evolves and connects.

In this keynote, I’ll share my experience on how complex projects like Apache Arrow evolve through shared protocols, cross-project conversations, and the people behind them. Along the way, we’ll look at the human side of technical work, the quiet strength of standards, and how imposter syndrome, while uncomfortable, has sharpened my curiosity and helped me find my own way of contributing.

Gaston Berger

10:00

5min

Room change

Gaston Berger

10:05

30min

From Jupyter Notebook to Publish-Ready Report: Effortless Sharing with Quarto

Christophe Dervieux

See how Quarto can transform your Jupyter notebooks into stakeholder-ready web pages or PDFs, published online with just one command. This session features practical demonstrations of publishing with quarto publish, applying custom styles tailored to your organization thanks to brand.yml, and leveraging new features for reproducible research.

Designed for anyone looking to share their work, this talk requires only basic Python and notebook familiarity. You’ll walk away with the skills to elevate your reporting workflow and share insights professionally.

Sylvain Corlay, Yann Lechelle, Alexander CS Hendorf

Challenges in economics and governance models for open-source scientific projects

In this presentation, the CEOs of two companies at the forefront of open-source scientific software development - Sylvain Corlay of QuantStack and Yann Lechelle of Probabl - examine the intricate challenges of open-source funding and governance and reflect on how these two aspects interconnect.

We start by reflecting on the origins of the open-source movement within the scientific community, and delve into the contemporary challenges of operating businesses and identifying sustainable economic models that both leverage and contribute to open-source software.

In particular, we highlight the unique approaches and experiences of QuantStack and Probabl, which primarily contribute to multi-stakeholder scientific projects such as scikit-learn, Jupyter, Apache Arrow, or conda-forge.

Gaston Berger

10:05

30min

State of Parquet 2025: Structure, Optimizations, and Recent Innovations

Rok Mihevc, Raúl Cumplido

If you worked with large amounts of tabular data, chances are you have dealt with Parquet files. Apache Parquet is an open source, column-oriented data file format designed for efficient storage and retrieval. It employs high performance compression and encoding schemes to handle complex data at scale and is supported in many programming language and analytics tools.
This talk will give a technical overview of Parquet format file structure, explain how the data is represented and stored in Parquet and why and how some of the possible configuration options might better match your specific use case.

We will also highlight some recent developments the and discussions in the Parquet community including Hugging Face's proposed content defined chunking - an approach that reduces required storage space by ten percent on realistic training datasets. We will also examine the geometry and geography types added to the Parquet specification in 2025, which enable efficient storage of spatial data and have catalyzed Parquet's growing adoption within the geospatial community.

Louis Armand 2 - Ouest

10:35

15min

Break

Gaston Berger

10:35

15min

Break

Louis Armand 1 - Est

10:35

15min

Break

Louis Armand 2 - Ouest

10:50

30min

A Hitchhiker's Guide to the Array API Standard Ecosystem

Lucas Colley

The array API standard is unifying the ecosystem of Python array computing, facilitating greater interoperability between code written for different array libraries, including NumPy, CuPy, PyTorch, JAX, and Dask.

But what are all of these "array-api-" libraries for? How can you use these libraries to 'future-proof' your libraries, and provide support for GPU and distributed arrays to your users? Find out in this talk, where I'll guide you through every corner of the array API standard ecosystem, explaining how SciPy and scikit-learn are using all of these tools to adopt the standard. I'll also be sharing progress updates from the past year, to give you a clear picture of where we are now, and what the future holds.

Louis Armand 2 - Ouest

10:50

30min

Collaborative GIS editing in JupyterLab

Arjun Verma, Martin Renou

JupyterGIS facilitates collaborative editing of GIS files, including the QGIS format, through a web-based interface built on JupyterLab. It also provides a programmatic interface tailored for Jupyter notebooks, making use of the advanced capabilities of the Jupyter rich display system.

In this presentation, we will first provide a high-level overview of the project’s main features.

We will then explore the latest developments, including the integration with the xarray stack and the Pangeo ecosystem, and the support for STAC geographical asset catalogs.

We conclude the talk with a forward-looking presentation of the ongoing development, such as the story maps feature, and the integration with the R programming language.

Louis Armand 1 - Est

10:50

30min

The new lockfile format introduced in PEP 751

Nico Albers

In March 2025, PEP 751 got accepted, proposing an new format how lockfiles should be structured. The talk will give a brief history of this PEP (and it's rejected predecessor), introduce you to the proposed pylock.toml format and discuss (subjective) highlights of this PEP. Afterwards, a practical example how this PEP could improve managing your environments will be discussed.

Gaston Berger

11:25

30min

Advanced Polars: Lazy Queries and Streaming Mode

Emanuele Fabbiani

Do you find yourself struggling with Pandas' limitations when handling massive datasets or real-time data streams?

Discover Polars, the lightning-fast DataFrame library built in Rust. This talk presents two advanced features of the next-generation dataframe library: lazy queries and streaming mode.

Lazy evaluation in Polars allows you to build complex data pipelines without the performance bottlenecks of eager execution. By deferring computation, Polars optimises your queries using techniques like predicate and projection pushdown, reducing unnecessary computations and memory overhead. This leads to significant performance improvements, particularly with datasets larger than your system’s physical memory.

Polars' LazyFrames form the foundation of the library’s streaming mode, enabling efficient streaming pipelines, real-time transformations, and seamless integration with various data sinks.

This session will explore use cases and technical implementations of both lazy queries and streaming mode. We’ll also include live-coding demonstrations to introduce the tool, showcase best practices, and highlight common pitfalls.

Attendees will walk away with practical knowledge of lazy queries and streaming mode, ready to apply these tools in their daily work as data engineers or data scientists.

Louis Armand 2 - Ouest

11:25

30min

Browser-based AI workflows in Jupyter

Jeremy Tuloup, Nicolas Brichet

JupyterLite brings Python and other programming languages to the browser, removing the need for a server. In this talk, we show how to extend it for AI workflows: connecting to remote models, running smaller models locally in the browser, and leveraging lightweight interfaces like a chat to interact with them.

Louis Armand 1 - Est

11:25

30min

Navigating the security compliance maze of an ML service

Uwe L. Korn

While everyone is talking about the m(e/a)ss of bureaucracy, we want to show you hands-on what you could need to be doing to operate an ML service. We will give an overview of things like ISO-27001 certifications, Cyber Resilience Act or AIBOMs. We want to highlight their impact/intention and give advice on how integrate them into your development workflow.

This talk is written from a practiconer's perspective and will help you set up your project to make your compliance department happy. It isn't meant as a deep-dive into the individual standards.

Gaston Berger

12:00

30min

Expanding Programming Language Support in JupyterLite

Isabel Paredes, Thorsten Beier, Antoine Prouvost, Ian Thomas

JupyterLite is a web-based distribution of JupyterLab that runs entirely in the browser, leveraging WebAssembly builds of language kernels and interpreters.

In this talk, we introduce emscripten-forge, a conda-based software distribution tailored for WebAssembly and the web browser. Emscripten-forge empowers several JupyterLite kernels, including:

xeus-Python for Python,
xeus-R for R,
xeus-Octave for GNU Octave.

These kernels cover some of the most popular languages in scientific computing.

Additionally, emscripten-forge includes builds for various terminal applications, utilized by the Cockle shell emulator to enable the JupyterLite terminal.

Louis Armand 1 - Est

12:00

30min

Sparrow, Pirates of the Apache Arrow

Alexis Placet, Johan Mabille

Sparrow is a lightweight C++20 idiomatic implementation of the Apache Arrow memory specification. Designed for compatibility with the Arrow C data interface, Sparrow enables seamless data exchange with other libraries supporting the Arrow format. It also offers high-level APIs, ensuring interoperability with standard modern C++ algorithms.

Louis Armand 2 - Ouest

12:00

30min

Unlock the full predictive power of your multi-table data

Luc-Aurélien Gauthier, Alexis Bondu

While most machine learning tutorials and challenges focus on single-table datasets, real-world enterprise data is often distributed across multiple tables, such as customer logs, transaction records, or manufacturing logs. In this talk, we address the often-overlooked challenge of building predictive features directly from raw, multi-table data. You will learn how to automate feature engineering using a scalable, supervised, and overfit-resistant approach, grounded in information theory and available as a Python open-source library. The talk is aimed at data scientists and ML engineers working with structured data; basic machine learning knowledge is sufficient to follow.

Gaston Berger

12:30

95min

Lunch

Gaston Berger

12:30

95min

Lunch

Louis Armand 1 - Est

12:30

95min

Lunch

Louis Armand 2 - Ouest

14:05

30min

Fighting against the instability : Debian Science at the synchrotron SOLEIL

Emmanuel FARHI

The talk addresses the challenges of maintaining and preserving the sovereignty of data processing tools in synchrotron X-ray experiments. It emphasizes the use of stable packaging systems like Debian-based distributions and fostering collaboration within the scientific community to ensure independence from external services and long-term support for software.

Gaston Berger

14:05

30min

How to make public data more accessible with "baked" data and DuckDB

Chris Kucharczyk

Publicly available data is rarely analysis-ready, hampering researchers, organizations, and the public from easily accessing the information these datasets contain. One way to address this shortcoming is to "bake" the data into a structured format and ship it alongside code that can be used for analysis. For analytical work in particular, DuckDB provides a performant way to query the structured data in a variety of contexts.

This talk will explore the benefits and tradeoffs of this architectural pattern using the design of scipeds–an open source Python package for analyzing higher-education data in the US–as a case study.

No DuckDB experience required, beginner Python and programming experience recommended. This talk is aimed at data practitioners, especially those who work with public datasets.

Louis Armand 2 - Ouest