SciPy 2026

Discovering Particles: How we analyze petabytes of particle collision data using python
2026-07-15 , University Hall

At CERN's Large Hadron Collider, we collide protons at near light-speed to discover new particles and understand fundamental physics. Python is becoming the primary language for analyzing this data, marking a significant evolution from the Fortran and C++ workflows of previous decades.

This talk explores the modern Python-based analysis pipeline of High-Energy Physics (HEP) and the technical challenges it addresses. We'll present how we handle nested, jagged data structures and work with data at the petabyte scale using the community-driven Scikit-HEP ecosystem of specialized tools for efficient and high-performance data analysis.

We'll show how we're building a Python stack that integrates with distributed computing frameworks and leverages GPU acceleration. Beyond domain-specific analysis tools, HEP's transition to Python has driven improvements to the broader Python packaging ecosystem, including contributions to cibuildwheel, the development of scikit-build-core, and advances in pybind11, benefiting anyone building Python packages with compiled extensions.


This talk takes you inside the data analysis pipeline at CERN's Large Hadron Collider, where physicists are transitioning from decades of Fortran and C++ workflows to Python-based analysis. We'll explore the technical challenges of working with petabyte-scale, nested data, and show how the solutions developed for High-Energy Physics (HEP) have become valuable tools for the broader Python community.

We will begin with understanding why HEP computing evolved the way it did. Fortran dominated for decades, then C++ and the ROOT framework became standard in the 90s. We'll explain what triggered the recent shift toward Python: the maturation of NumPy and the scientific stack, the need for faster iteration, and the desire to make analysis more accessible. This history explains the design constraints and opportunities that shaped today's tools.

At the heart of modern HEP analysis is Scikit-HEP, a community-driven collection of Python packages. We'll dive into the key components: uproot enables pure-Python access to ROOT files without C++ dependencies, Awkward Array provides NumPy-like operations on jagged data structures, hist delivers high-performance histogramming, and additional libraries handle vector math and statistical fitting. Through code examples, we'll demonstrate how these pieces fit together in an actual analysis workflow.

One of the most interesting technical problems is the structure of collision data itself. When protons collide, each event produces a different number of particles, each with multiple properties. Traditional rectilinear arrays can't represent this naturally. You need nested, variable-length arrays. This isn't just a physics problem; it's the same challenge you face with nested JSON-like data. We'll show how Awkward Array's approach to jagged data offers an elegant solution that's applicable far beyond physics.

Scale presents another major challenge. The High-Luminosity LHC upgrade will require analyzing petabytes in under an hour. We'll present our approach: leveraging distributed computing systems (like Dask) across clusters, using GPU acceleration where it provides the most benefit, and designing analysis facilities that colocate computation with data storage. These patterns are relevant to anyone tackling large-scale data problems.

HEP's relatively late adoption of Python created an interesting dynamic: we needed production-quality infrastructure for building binary extensions but didn't have legacy tools to maintain. This drove significant contributions to the Python packaging ecosystem. We needed reliable cross-platform wheel building for packages like boost-histogram, awkward, and iminuit, which led to major improvements in cibuildwheel. We needed better build systems for C++ extensions, which resulted in scikit-build-core. We pushed forward pybind11 development and originally created the Scientific Python development guide and cookie template. These infrastructure improvements now benefit anyone distributing Python packages with compiled code.

The broader theme is how domain-specific needs can drive general-purpose innovation. The tools and infrastructure HEP has developed address problems common across scientific computing and data engineering.

I'm a PhD student in the Department of Physics and Astronomy at Rice University, conducting research in high-energy physics as a member of the CMS experiment at the Large Hadron Collider at CERN. My work focuses on studying Higgs boson decays into two photons, analyzing data collected by the CMS detector, and contributing to software development for large-scale scientific analyses. I'm passionate about scientific computing and open-source tools that enable reproducible and efficient research. I’m maintainer of Awkward Array, an array library for nested, variable-sized data, using NumPy-like idioms, and an author and maintainer of Coffea, a toolkit designed to simplify data analysis in particle physics. With deep experience in the scientific Python ecosystem, I enjoy building tools that drive insight and accelerate scientific discovery.

This speaker also appears in: