To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
60min
Registration & Welcome Coffee
Gaston Berger
09:00
09:00
15min
Opening session
Gaston Berger
09:15
09:15
45min
You Don’t Have to Be an Expert: Stories from the Open Source Frontlines
Alenka Frim

Four years ago, I had no idea what PyArrow was—or how open source development worked. But through mentorship, collaboration, and learning in public, I found not just a place in the community, but a sense of how open source evolves and connects.

In this keynote, I’ll share my experience on how complex projects like Apache Arrow evolve through shared protocols, cross-project conversations, and the people behind them. Along the way, we’ll look at the human side of technical work, the quiet strength of standards, and how imposter syndrome, while uncomfortable, has sharpened my curiosity and helped me find my own way of contributing.

Gaston Berger
10:00
10:00
5min
Room change
Gaston Berger
10:05
10:05
30min
From Jupyter Notebook to Publish-Ready Report: Effortless Sharing with Quarto
Christophe Dervieux

See how Quarto can transform your Jupyter notebooks into stakeholder-ready web pages or PDFs, published online with just one command. This session features practical demonstrations of publishing with quarto publish, applying custom styles tailored to your organization thanks to brand.yml, and leveraging new features for reproducible research.

Designed for anyone looking to share their work, this talk requires only basic Python and notebook familiarity. You’ll walk away with the skills to elevate your reporting workflow and share insights professionally.

Louis Armand 1 - Est
10:05
30min
Open-source Business
Sylvain Corlay, Yann Lechelle, Probabl.ai

Challenges in economics and governance models for open-source scientific projects

In this presentation, the CEOs of two companies at the forefront of open-source scientific software development - Sylvain Corlay of QuantStack and Yann Lechelle of Probabl - examine the intricate challenges of open-source funding and governance and reflect on how these two aspects interconnect.

We start by reflecting on the origins of the open-source movement within the scientific community, and delve into the contemporary challenges of operating businesses and identifying sustainable economic models that both leverage and contribute to open-source software.

In particular, we highlight the unique approaches and experiences of QuantStack and Probabl, which primarily contribute to multi-stakeholder scientific projects such as scikit-learn, Jupyter, Apache Arrow, or conda-forge.

Gaston Berger
10:05
30min
State of Parquet 2025: Structure, Optimizations, and Recent Innovations
Rok Mihevc, Raúl Cumplido

If you worked with large amounts of tabular data, chances are you have dealt with Parquet files. Apache Parquet is an open source, column-oriented data file format designed for efficient storage and retrieval. It employs high performance compression and encoding schemes to handle complex data at scale and is supported in many programming language and analytics tools.
This talk will give a technical overview of Parquet format file structure, explain how the data is represented and stored in Parquet and why and how some of the possible configuration options might better match your specific use case.

We will also highlight some recent developments the and discussions in the Parquet community including Hugging Face's proposed content defined chunking - an approach that reduces required storage space by ten percent on realistic training datasets. We will also examine the geometry and geography types added to the Parquet specification in 2025, which enable efficient storage of spatial data and have catalyzed Parquet's growing adoption within the geospatial community.

Louis Armand 2 - Ouest
10:35
10:35
15min
Break
Gaston Berger
10:35
15min
Break
Louis Armand 1 - Est
10:35
15min
Break
Louis Armand 2 - Ouest
10:50
10:50
30min
A Hitchhiker's Guide to the Array API Standard Ecosystem
Lucas Colley

The array API standard is unifying the ecosystem of Python array computing, facilitating greater interoperability between code written for different array libraries, including NumPy, CuPy, PyTorch, JAX, and Dask.

But what are all of these "array-api-" libraries for? How can you use these libraries to 'future-proof' your libraries, and provide support for GPU and distributed arrays to your users? Find out in this talk, where I'll guide you through every corner of the array API standard ecosystem, explaining how SciPy and scikit-learn are using all of these tools to adopt the standard. I'll also be sharing progress updates from the past year, to give you a clear picture of where we are now, and what the future holds.

Louis Armand 2 - Ouest
10:50
30min
Collaborative GIS editing in JupyterLab
Arjun Verma, Martin Renou

JupyterGIS facilitates collaborative editing of GIS files, including the QGIS format, through a web-based interface built on JupyterLab. It also provides a programmatic interface tailored for Jupyter notebooks, making use of the advanced capabilities of the Jupyter rich display system.

In this presentation, we will first provide a high-level overview of the project’s main features.

We will then explore the latest developments, including the integration with the xarray stack and the Pangeo ecosystem, and the support for STAC geographical asset catalogs.

We conclude the talk with a forward-looking presentation of the ongoing development, such as the story maps feature, and the integration with the R programming language.

Louis Armand 1 - Est
10:50
30min
The new lockfile format introduced in PEP 751
Nico Albers

In March 2025, PEP 751 got accepted, proposing an new format how lockfiles should be structured. The talk will give a brief history of this PEP (and it's rejected predecessor), introduce you to the proposed pylock.toml format and discuss (subjective) highlights of this PEP. Afterwards, a practical example how this PEP could improve managing your environments will be discussed.

Gaston Berger
11:25
11:25
30min
Advanced Polars: Lazy Queries and Streaming Mode
Emanuele Fabbiani

Do you find yourself struggling with Pandas' limitations when handling massive datasets or real-time data streams?

Discover Polars, the lightning-fast DataFrame library built in Rust. This talk presents two advanced features of the next-generation dataframe library: lazy queries and streaming mode.

Lazy evaluation in Polars allows you to build complex data pipelines without the performance bottlenecks of eager execution. By deferring computation, Polars optimises your queries using techniques like predicate and projection pushdown, reducing unnecessary computations and memory overhead. This leads to significant performance improvements, particularly with datasets larger than your system’s physical memory.

Polars' LazyFrames form the foundation of the library’s streaming mode, enabling efficient streaming pipelines, real-time transformations, and seamless integration with various data sinks.

This session will explore use cases and technical implementations of both lazy queries and streaming mode. We’ll also include live-coding demonstrations to introduce the tool, showcase best practices, and highlight common pitfalls.

Attendees will walk away with practical knowledge of lazy queries and streaming mode, ready to apply these tools in their daily work as data engineers or data scientists.

Louis Armand 2 - Ouest
11:25
30min
Browser-based AI workflows in Jupyter
Jeremy Tuloup, Nicolas Brichet

JupyterLite brings Python and other programming languages to the browser, removing the need for a server. In this talk, we show how to extend it for AI workflows: connecting to remote models, running smaller models locally in the browser, and leveraging lightweight interfaces like a chat to interact with them.

Louis Armand 1 - Est
11:25
30min
Navigating the security compliance maze of an ML service
Uwe L. Korn

While everyone is talking about the m(e/a)ss of bureaucracy, we want to show you hands-on what you could need to be doing to operate an ML service. We will give an overview of things like ISO-27001 certifications, Cyber Resilience Act or AIBOMs. We want to highlight their impact/intention and give advice on how integrate them into your development workflow.

This talk is written from a practiconer's perspective and will help you set up your project to make your compliance department happy. It isn't meant as a deep-dive into the individual standards.

Gaston Berger
12:00
12:00
30min
Expanding Programming Language Support in JupyterLite
Isabel Paredes, Thorsten Beier, Antoine Prouvost, Ian Thomas

JupyterLite is a web-based distribution of JupyterLab that runs entirely in the browser, leveraging WebAssembly builds of language kernels and interpreters.

In this talk, we introduce emscripten-forge, a conda-based software distribution tailored for WebAssembly and the web browser. Emscripten-forge empowers several JupyterLite kernels, including:

  • xeus-Python for Python,
  • xeus-R for R,
  • xeus-Octave for GNU Octave.

These kernels cover some of the most popular languages in scientific computing.

Additionally, emscripten-forge includes builds for various terminal applications, utilized by the Cockle shell emulator to enable the JupyterLite terminal.

Louis Armand 1 - Est
12:00
30min
Sparrow, Pirates of the Apache Arrow
Alexis Placet, Johan Mabille

Sparrow is a lightweight C++20 idiomatic implementation of the Apache Arrow memory specification. Designed for compatibility with the Arrow C data interface, Sparrow enables seamless data exchange with other libraries supporting the Arrow format. It also offers high-level APIs, ensuring interoperability with standard modern C++ algorithms.

Louis Armand 2 - Ouest
12:00
30min
Unlock the full predictive power of your multi-table data
Luc-Aurélien Gauthier, Alexis Bondu

While most machine learning tutorials and challenges focus on single-table datasets, real-world enterprise data is often distributed across multiple tables, such as customer logs, transaction records, or manufacturing logs. In this talk, we address the often-overlooked challenge of building predictive features directly from raw, multi-table data. You will learn how to automate feature engineering using a scalable, supervised, and overfit-resistant approach, grounded in information theory and available as a Python open-source library. The talk is aimed at data scientists and ML engineers working with structured data; basic machine learning knowledge is sufficient to follow.

Gaston Berger
12:30
12:30
95min
Lunch
Gaston Berger
12:30
95min
Lunch
Louis Armand 1 - Est
12:30
95min
Lunch
Louis Armand 2 - Ouest
14:05
14:05
30min
Fighting against the instability : Debian Science at the synchrotron SOLEIL
Emmanuel FARHI

The talk addresses the challenges of maintaining and preserving the sovereignty of data processing tools in synchrotron X-ray experiments. It emphasizes the use of stable packaging systems like Debian-based distributions and fostering collaboration within the scientific community to ensure independence from external services and long-term support for software.

Gaston Berger
14:05
30min
How to make public data more accessible with "baked" data and DuckDB
Chris Kucharczyk

Publicly available data is rarely analysis-ready, hampering researchers, organizations, and the public from easily accessing the information these datasets contain. One way to address this shortcoming is to "bake" the data into a structured format and ship it alongside code that can be used for analysis. For analytical work in particular, DuckDB provides a performant way to query the structured data in a variety of contexts.

This talk will explore the benefits and tradeoffs of this architectural pattern using the design of scipeds–an open source Python package for analyzing higher-education data in the US–as a case study.

No DuckDB experience required, beginner Python and programming experience recommended. This talk is aimed at data practitioners, especially those who work with public datasets.

Louis Armand 2 - Ouest
14:05
30min
Tackling Domain Shift with SKADA: A Hands-On Guide to Domain Adaptation
Théo Gnassounou, Antoine Collas

Domain adaptation addresses the challenge of applying ML models to data that differs from the training distribution—a common issue in real-world applications. SKADA is a new Python library that brings domain adaptation tools to the sci-kit-learn and PyTorch ecosystem. This talk covers SKADA’s design, its integration with standard ML workflows, and how it helps practitioners build models that generalize better across domains.

Louis Armand 1 - Est
14:40
14:40
30min
Modern Web Data Extraction: Techniques, Tools, Legal and Ethical Considerations
Domagoj Marić

To satisfy the need for data in generative and traditional AI, in a rapidly evolving environment, the ability to efficiently extract data from the web has become indispensable for businesses and developers. This presentation delves into the methodology and tools of web crawling and web scraping, with an overview of the ethical and legal side of the process, including the best practices on how to crawl politely and efficiently and use the data to not violate any privacy or intellectual property laws.

Louis Armand 2 - Ouest
14:40
30min
Optimal Transport in Python: A Practical Introduction with POT
Rémi Flamary

Optimal Transport (OT) is a powerful mathematical framework with applications in machine learning, statistics, and data science. This talk introduces the Python Optimal Transport toolbox (POT), an open-source library designed to efficiently solve OT problems. Attendees will learn the basics of OT, explore real-world use cases, and gain hands-on experience with POT (https://pythonot.github.io/) .

Louis Armand 1 - Est
14:40
30min
Reproducible software provisioning for high performance computing (HPC) and research software engineering (RSE) using Spack
Martin Lang, hans.fangohr@mpsd.mpg.de

In this talk we focus on installing software (stacks) beyond just the Python ecosystem. In the first part of the talk we give an introduction to using the package manager Spack (https://spack.readthedocs.io). In the second part we explain how we use Spack at our institute to manage the software stack on the local HPC.

Gaston Berger
15:15
15:15
30min
ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences
Emilien SCHULTZ, Etienne Ollion, Paul Girard, Julien Boelaert

The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.

To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.

From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.

In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.

The repository of the project : https://github.com/emilienschultz/activetigger/

The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.

Gaston Berger
15:15
30min
Code as Data: A Practical Introduction to Python’s Abstract Syntax Tree
Laurent Direr

Peek under the hood of Python and unlock the power of its Abstract Syntax Tree! We'll demystify the AST and explore how it powers tools like pytest, linters, and refactoring - as well as some of your favorite data libraries.

Louis Armand 1 - Est
15:15
30min
You Don’t Need Spark for That: Pythonic Data Lakehouse Workflows
Romain Clement

Have you ever spun up a Spark cluster just to update three rows in a Delta table? In this talk, we’ll explore how modern Python libraries can power lightweight, production-grade Data Lakehouse workflows—helping you avoid over-engineering your data stack.

Louis Armand 2 - Ouest
15:45
15:45
15min
Break
Gaston Berger
15:45
15min
Break
Louis Armand 1 - Est
15:45
15min
Break
Louis Armand 2 - Ouest
16:00
16:00
45min
Big ideas shaping scientific Python: the quest for performance and usability
Ralf Gommers

Behind every technical leap in scientific Python lies a human ecosystem of volunteers, companies, and institutions working in tension and collaboration. This keynote explores how innovation actually happens in open source, through the lens of recent and ongoing initiatives that aim to move the needle on performance and usability - from the ideas that went into NumPy 2.0 and its relatively smooth rollout to the ongoing efforts to leverage the performance GPUs offer without sacrificing maintainability and usability.

Takeaways for the audience: Whether you’re an ML engineer tired of debugging GPU-CPU inconsistencies, a researcher pushing Python to its limits, or an open-source maintainer seeking sustainable funding, this keynote will equip you with both practical solutions and a clear vision of where scientific Python is headed next.

Gaston Berger
16:45
16:45
60min
Lightning Talks
Gaston Berger
08:00
08:00
60min
Registration & Welcome Coffee
Gaston Berger
09:00
09:00
15min
Forewords
Gaston Berger
09:15
09:15
45min
Building Data Science Tools for Sustainable Transformation
Anita Graser

The current AI hype, driven by generative AI and particularly large language models, is creating excitement, fear, and inflated expectations. In this keynote, we'll explore geographic & mobility data science tools (such as GeoPandas and MovingPandas) to transform this hype into sustainable and positive development that empowers users.

Gaston Berger
10:00
10:00
5min
Room change
Gaston Berger
10:05
10:05
30min
Balancing Privacy and Utility: Efficient PII Detection and Replacement in Textual Data
Elizaveta Clouet, Justine BEL-LETOILE

Anonymizing free-text data is harder than it seems. While structured databases have well-established anonymization techniques, textual data — like invoices, resumes, or medical records — poses unique challenges. Personally identifiable information (PII) can appear anywhere, in unpredictable formats, and how to modify it while preserving the dataset's usefulness?

Let's explore a practical, open-source 2-step approach to text anonymization: (1) detecting PII using NER models and (2) replacing it while preserving key dataset characteristics (e.g. document formatting, statistical distributions). We will demonstrate how to build a robust pipeline leveraging tools such as pre-trained PII detection models, gliner for fine-tuning, or Faker for generating meaningful replacements.

Ideal for those with a basic understanding of NLP, this session offers practical insights for anyone working with sensitive textual data.

Gaston Berger
10:05
30min
Sharing computational course material at larger scale: a French multi-tenant attempt
Nicolas M. Thiéry

With the rise of computation and data as pillars of science, institutions are struggling to provide large-scale training to their students and staff. Often, this leads to redundant, fragmented efforts, with each organization producing its own bespoke training material. In this talk, we report on a collaborative multi-tenant initiative to produce a shared corpus of interactive training resources in the Python language, designed as a digital common that can be adapted to diverse contexts and formats in French higher education and beyond.

Louis Armand 2 - Ouest
10:05
30min
Skrub: machine learning for dataframes
Riccardo Cappuzzo, Jérôme Dockès, Guillaume Lemaitre

Machine-learning algorithms expect a numeric array with one row per observation. Typically, creating this table requires "wrangling" with Pandas or Polars (aggregations, selections, joins, ...), and to extract numeric features from structured data types such as datetimes. These transformations must be applied consistently when making predictions for unseen inputs, and choices must be informed by performance measured on a validation dataset, while preventing data leakage. This preprocessing is the most difficult and time-consuming part of many data-science projects.

Skrub bridges the gap between complex tabular data stored in Pandas or Polars dataframes, and machine-learning algorithms implemented by scikit-learn estimators. It provides scikit-learn transformers to extract features from datetimes, (fuzzy) categories and text, and to perform data-wrangling such as joins and aggregations in a learning pipeline. Its pre-built, flexible learners offer very robust performance on many tabular datasets without manual tweaking. It can create complex pipelines that handle multiple tables, while easily describing and searching rich hyperparameter spaces. As interactivity and visualization are essential for preprocessing, Skrub also provides an interactive report to explore a dataframe, and its pipelines can be built incrementally while inspecting intermediate results.

Louis Armand 1 - Est
10:35
10:35
15min
Break
Gaston Berger
10:35
15min
Break
Louis Armand 1 - Est
10:35
15min
Break
Louis Armand 2 - Ouest
10:50
10:50
30min
Meta-Dashboards: Accelerating Geospatial Web Apps Creation with Voilà
Davide De Marchi

The Joint Research Centre has cultivated significant expertise in developing Voilà dashboards for scientific data visualization, resulting in the design and deployment of many real-world web applications. This presentation will highlight our commitment to building a robust Voilà developer community through dedicated training and resource libraries. We will introduce and demonstrate our innovative meta-dashboards, which streamline the creation of complex, multi-page dashboards by automating framework and code generation. A live demonstration will illustrate the ease of building a geospatial application using this tool. We will conclude with a showcase of recently developed Voilà dashboards in areas such as agricultural/biodiversity surveys and air quality monitoring, demonstrating their effectiveness in data exploration and validation.

Louis Armand 2 - Ouest
10:50
30min
Move beyond academia: Introducing an industry-first tabular benchmark
Alexandre Abraham

Discover a new benchmark designed for real-world impact. Built on authentic private-company data and carefully chosen public datasets that reflect real industry challenges, like product categorization, basket prediction, and personalized recommendations, it offers a realistic testing ground for both classic baselines (e.g., gradient boosting) and the latest models such as CARTE, TabICL, and TabPFN. By bridging the gap between academic research and industrial needs, this benchmark brings model evaluation closer to the decisions and constraints faced in practice.

This shift has tangible consequences: models are tested on problems that matter to businesses, using metrics that reflect real-world priorities (e.g., Precision@K, Recall@K, MAP@K). It enables more relevant model selection, highlights where academic approaches fall short, and fosters solutions that are not just novel but deployable. Models are judged on tasks and metrics that matter, enabling more informed choices, exposing the limits of lab-only approaches, and helping accelerate the journey from innovation to deployment.

Gaston Berger
10:50
30min
PyPI in the face: running jokes that PyPI download stats can play on you
Loïc Estève

We all love to tell stories with data and we all love to listen to them. Wouldn't it be great if we could also draw actionable insights from these nice stories?

As scikit-learn maintainers, we would love to use PyPI download stats and other proxy metrics (website analytics, github repository statistics, etc ...) to help inform some of our decisions like:
- how do we increase user awareness of best practices (please use Pipeline and cross-validation)?
- how do we advertise our recent improvements (use HistGradientBoosting rather than GradientBoosting, TunedThresholdClassifier, PCA and a few other models can run on GPU) ?
- do users care more about new features from recent releases or consolidation of what already exists?
- how long should we support older versions of Python, numpy or scipy ?

In this talk we will highlight a number of lessons learned while trying to understand the complex reality behind these seemingly simple metrics.

Telling nice stories is not always hard, trying to grasp the reality behind these metrics is often tricky.

Louis Armand 1 - Est
11:25
11:25
30min
Beyond Prototyping: Building Production-Level Apps with Streamlit
Johannes Rieke

Streamlit is a great tool for prototyping data apps, but is it also fit for complex, production-level apps? In this talk, the Streamlit team will showcase new features, LLM integrations, and deployment options that can help you effectively use Streamlit in your company, whether it’s a small startup or a large enterprise.

Louis Armand 2 - Ouest
11:25
30min
CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training
Simeon Carstens

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance.
In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.

Louis Armand 1 - Est
11:25
30min
Enhancing Machine Learning Workflows with skore
Marie Sacksick

Discover how skore, a new-born open-source Python library, can elevate your machine learning projects by integrating recommended practices and avoiding common pitfalls. This talk will introduce skore's key features and demonstrate how it can streamline your model evaluation and diagnostics processes.

Gaston Berger
12:00
12:00
30min
A Journey Through a Geospatial Data Pipeline: From Raw Coordinates to Actionable Insights
Gravin Florent

Every dataset has a story — and when it comes to geospatial data, it’s a story deeply rooted in space and scale. But working with geospatial information is often a hidden challenge: massive file sizes, strange formats, projections, and pipelines that don't scale easily.

In this talk, we'll follow the life of a real-world geospatial dataset, from its raw collection in the field to its transformation into meaningful insights. Along the way, we’ll uncover the key steps of building a robust, scalable open-source geospatial pipeline.

Drawing on years of experience at Camptocamp, we’ll explore:

  • How raw spatial data is ingested and cleaned
  • How vector and raster data are efficiently stored and indexed (PostGIS, Cloud Optimized GeoTIFFs, Zarr)
  • How modern tools like Dask, GeoServer, and STAC (SpatioTemporal Asset Catalogs) help process and serve geospatial data
  • How to design pipelines that handle both "small data" (local shapefiles) and "big data" (terabytes of satellite imagery)
  • Common pitfalls and how to avoid them when moving from prototypes to production

This journey will show how the open-source ecosystem has matured to make geospatial big data accessible — and how spatial thinking can enrich almost any data project, whether you are building dashboards, doing analytics, or setting the stage for machine learning later on.

Louis Armand 2 - Ouest
12:00
30min
How to do real TDD in data science? A journey from pandas to polars with pelage!
Alix Tiran-Cappello

In the world of data, inconsistencies or inaccuracies often presents a major challenge to extract valuable insights. Yet the number of robust tools and practices to address those issues remain limited. Particularly, the practice of TDD remains quite difficult in data science, while it is a standard among classic software development, also because of poorly adapted tools and frameworks.

To address this issue we released Pelage, an open-source Python package to facilitate data exploration and testing, which relies on Polars intuitive syntax and speed. Pelage empowers data scientists and analysts to facilitate data transformation, enhance data quality and improve code clarity.

We will demonstrate, in a test-first approach, how you can use this library in a meaningful data science workflow to gain greater confidence for your data transformations.

See website: https://alixtc.github.io/pelage/

Louis Armand 1 - Est
12:00
30min
Probabilistic regression models: let's compare different modeling strategies and discuss how to evaluate them
Olivier Grisel

Most common machine learning models (linear, tree-based or neural network-based), optimize for the least squares loss when trained for regression tasks. As a result, they output a point estimate of the conditional expected value of the target: E[y|X].

In this presentation, we will explore several ways to train and evaluate probabilistic regression models as a richer alternative to point estimates. Those models predict a richer description of the full distribution of y|X and allow us to quantify the predictive uncertainty for individual predictions.

On the model training part, we will introduce the following options:

  • ensemble of quantile regressors for a grid of quantile levels (using linear models or gradient boosted trees in scikit-learn, XGBoost and PyTorch),
  • how to reduce probabilistic regression to multi-class classification + a cumulative sum of the predict_proba output to recover a continuous conditional CDF.
  • how to implement this approach as a generic scikit-learn meta-estimator;
  • how this approach is used to pretrain foundational tabular models (e.g. TabPFNv2).
  • simple Bayesian models (e.g. Bayesian Ridge and Gaussian Processes);
  • more specialized approaches as implemented in XGBoostLSS.

We will also discuss how to evaluate probabilistic predictions via:

  • the pinball loss of quantile regressors,
  • other strictly proper scoring rules such as Continuous Ranked Probability Score (CRPS),
  • coverage measures and width of prediction intervals,
  • reliability diagrams for different quantile levels.

We will illustrate of those concepts with concrete examples and running code.

Finally, we will illustrate why some applications need such calibrated probabilistic predictions:

  • estimating uncertainty in trip times depending on traffic conditions to help a human decision make choose among various travel plan options.
  • modeling value at risk for investment decisions,
  • assessing the impact of missing variables for an ML model trained to work in degraded mode,
  • Bayesian optimization for operational parameters of industrial machines from little/costly observations.

If time allows, will also discuss usage and limitations of Conformal Quantile Regressors as implemented in MAPIE and contrast aleatoric vs epistemic uncertainty captured by those models.

Gaston Berger
12:30
12:30
95min
Lunch
Gaston Berger
12:30
95min
Lunch
Louis Armand 1 - Est
12:30
95min
Lunch
Louis Armand 2 - Ouest
14:05
14:05
30min
Applying Causal Inference in Industry 4.0: A Case Study from Glasswool Production
Simona Bottani, Patrick Lee

Causal inference offers a principled way to estimate the effects of interventions—a critical need in industrial settings where decisions directly impact costs and performance. This talk presents a case study from Saint-Gobain, in collaboration with Inria, where we applied causal inference methods to production and quality data to reduce raw material usage without compromising product quality.
We’ll walk through each step of a causal analysis: building a causal graph in collaboration with domain experts, identifying confounders, working with continuous treatments, and using open-source tools such as DoWhy, EconML, and DAGitty. The talk is aimed at data scientists with basic ML experience, looking to apply causal thinking to real-world, non-academic problems.

Louis Armand 2 - Ouest
14:05
30min
Documents Meet LLMs: Tales from the Trenches
Nour El Mawass, Miklos Erdelyi

Processing documents with LLMs comes with unexpected challenges: handling long inputs, enforcing structured outputs, catching hallucinations, and recovering from partial failures.
In this talk, we’ll cover why large context windows are not a silver bullet, why chunking is deceptively hard and how to design input and output that allow for intelligent retrial. We'll also share practical prompting strategies, discuss OCR and parsing tools, compare different LLMs (and their cloud APIs) and highlight real-world insights from our experience developing production GenAI applications with multiple document processing scenarios.

Gaston Berger
14:05
30min
Machine Learning in the Browser: Fast Iteration with ONNX & WebAssembly
Romain Clement

Deploying ML models doesn’t have to mean spinning up servers and writing backend code. This talk shows how to run machine learning inference directly in the browser—using ONNX and WebAssembly—to go from prototype to interactive demo in minutes, not weeks.

Louis Armand 1 - Est
14:40
14:40
30min
Advancements in optimizing ML Inference at CERN
Sanjiban Sengupta

At CERN—the European Organization for Nuclear Research—machine learning is applied across a wide range of scenarios, from simulations and event reconstruction to classifying interesting experimental events, all while handling data rates in the order of terabytes per second. As a result, beyond developing complex models, CERN also requires highly optimized mechanisms for model inference.

From the ML4EP team at CERN, we have developed SOFIE (System for Optimized Fast Inference code Emit), an open-source tool designed for fast inference on ML models with minimal dependencies and low latency. SOFIE is under active development, driven by feedback not only from high-energy physics researchers but also from the broader scientific community.

With upcoming upgrades to CERN’s experiments expected to increase data generation, we have been investigating optimization methods to make SOFIE even more efficient in terms of time and memory usage, while improving its accessibility and ease of integration with other software stacks.

In this talk, we will introduce SOFIE and present novel optimization strategies developed to accelerate ML inference and reduce resource overhead.

Louis Armand 2 - Ouest
14:40
30min
Build a data studio in your notebook with jupyter-fs
Tim Paine

jupyter-fs provides an interface between PyFilesystem and fsspec file systems, the JupyterLab user interface, and the Jupyter notebooks you run. Connect and browse your local filesystem, S3, Samba, WebDAV, and more, interacting with data seamlessly from both the JupyterLab UI and your notebook's kernel.

Louis Armand 1 - Est
14:40
30min
Repetita Non Iuvant: Why Generative AI Models Cannot Feed Themselves
Valeria Zuccoli

As AI floods the digital landscape with content, what happens when it starts repeating itself?
This talk explores model collapse, a progressive erosion where LLMs and image generators loop on their own results, hindering the creation of novel output.

We will show how self-training leads to bias and loss of diversity, examine the causes of this degradation, and quantify its impact on model creativity.
Finally, we will also present concrete strategies to safeguard the future of generative AI, emphasizing the critical need to preserve innovation and originality.

By the end of this talk, attendees will gain insights into the practical implications of model collapse, understanding its impact on content diversity and the long-term viability of AI.

Gaston Berger
15:15
15:15
30min
Building Resilient (ML) Pipelines for MLOps
Lex Avstreikh

This talk explores the disconnect between MLOps fundamental principles and their practical application in designing, operating and maintaining machine learning pipelines. We’ll break down these principles, examine their influence on pipeline architecture, and conclude with a straightforward, vendor-agnostic mind-map, offering a roadmap to build resilient MLOps systems for any project or technology stack.
Despite the surge in tools and platforms, many teams still struggle with the same underlying issues: brittle data dependencies, poor observability, unclear ownership, and pipelines that silently break once deployed. Architecture alone isn't the answer — systems thinking is.

We'll use concrete examples to walk through common failure modes in ML pipelines, highlight where analogies fall apart, and show how to build systems that tolerate failure, adapt to change, and support iteration without regressions.

Topics covered include:
- Common failure modes in ML pipelines
- Modular design: feature, training, inference
- Built-in observability, versioning, reuse
- Orchestration across batch, real-time, LLMs
- Platform-agnostic patterns that scale

Key takeaways:
- Resilience > diagrams
- Separate concerns, embrace change
- Metadata is your backbone
- Infra should support iteration, not block it

Louis Armand 2 - Ouest
15:15
30min
From Language to Knowledge: How SpaCy Can Build Better AI Models
Anushka Narula

Natural language processing (NLP) models are great at recognizing patterns, but they often fail to understand context and meaning. In this talk, I’ll show how to combine SpaCy’s NLP capabilities with knowledge-based AI (KBAI) to build smarter, context-aware models that improve accuracy and reasoning.

Gaston Berger
15:15
30min
xeus-cpp, the new C++ kernel for Jupyter.
Johan Mabille, Anutosh Bhat

xeus-cpp is the next-generation Jupyter kernel for C++, replacing the outdated xeus-cling. It support recent versions of the language, comes with new features, can be extended and even provide a jupyter-lite kernel.

Louis Armand 1 - Est
15:45
15:45
15min
Break
Gaston Berger
15:45
15min
Break
Louis Armand 1 - Est
15:45
15min
Break
Louis Armand 2 - Ouest
16:00
16:00
30min
Architecting Scalable Multi-Modal Video Search
Irene Donato

The exponential growth of video data presents significant challenges for effective content discovery. Traditional keyword search falls short when dealing with visual nuances. This talk addresses the design and implementation of a robust system for large-scale, multi-modal video retrieval, enabling search across petabytes of data using diverse inputs like text descriptions (e.g., appearance, actions) and query images (e.g., faces). We will explore an architecture combining efficient batch preprocessing for feature extraction (including person detection, face/CLIP-style embeddings) with optimized vector database indexing. Attendees will learn about strategies for managing massive datasets, optimizing ML inference pipelines for speed and cost-efficiency (touching upon lightweight models and specialized runtimes), and building interactive systems that bridge pre-computed indexes with real-time analysis capabilities for enhanced insights.

Gaston Berger
16:00
30min
CoSApp: an open-source library to design complex systems
Étienne Lac

CoSApp, for Collaborative System Approach, is a Python library dedicated to the simulation and design of multi-disciplinary systems. It is primarily intended for engineers and system architects during the early stage of industrial product design. The API of CoSApp is focused on simplicity and explicit declaration of design problems. Special attention is given to modularity; a very flexible mechanism of solver assembly allows users to construct complex, customized simulation workflows.
This presentation aims at presenting the key features of the framework.

https://cosapp.readthedocs.io
https://gitlab.com/cosapp/cosapp

Louis Armand 2 - Ouest
16:00
30min
Parallel processing using CRDTs
David Brochart

Beyond embarrassingly parallel processing problems, data must be shared between workers for them to do something useful. This can be done by:
- sharing memory between threads, with the issue of preventing access to shared data to avoid race conditions.
- copying memory to subprocesses, with the challenge of synchronizing data whenever it is mutated.

In Python, using threads is not an option because of the GIL (global interpreter lock), which prevents true parallelism. This might change in the future with the removal of the GIL, but usual problems with multithreading will appear, such as using locks and managing their complexity. Subprocesses don't suffer from the GIL, but usually need to access a database for sharing data, which is often too slow. Algorithms such as HAMT (hash array mapped trie) have been used to efficiently and safely share data stored in immutable data structures, removing the need for locks. In this talk we will show how CRDTs (conflict-free replicated data type) can be used for the same purpose.

Louis Armand 1 - Est
16:30
16:30
60min
Lightning Talks
Gaston Berger
17:30
17:30
15min
Farewell
Gaston Berger
09:30
09:30
450min
Project Sprints
Carrefour Numérique