Advancements in optimizing ML Inference at CERN PyData Paris 2025

Advancements in optimizing ML Inference at CERN
.ical

2025-10-01 14:40–15:10, Louis Armand 2 - Ouest

At CERN—the European Organization for Nuclear Research—machine learning is applied across a wide range of scenarios, from simulations and event reconstruction to classifying interesting experimental events, all while handling data rates in the order of terabytes per second. As a result, beyond developing complex models, CERN also requires highly optimized mechanisms for model inference.

From the ML4EP team at CERN, we have developed SOFIE (System for Optimized Fast Inference code Emit), an open-source tool designed for fast inference on ML models with minimal dependencies and low latency. SOFIE is under active development, driven by feedback not only from high-energy physics researchers but also from the broader scientific community.

With upcoming upgrades to CERN’s experiments expected to increase data generation, we have been investigating optimization methods to make SOFIE even more efficient in terms of time and memory usage, while improving its accessibility and ease of integration with other software stacks.

In this talk, we will introduce SOFIE and present novel optimization strategies developed to accelerate ML inference and reduce resource overhead.

Experiments at CERN generate a tremendous amount of data every second, which is then analyzed for various discoveries and processes. From simulating physical phenomena to reconstructing the tracks of fundamental particles using GenAI architectures like diffusion models, from applying anomaly detection to uncover unknown physics phenomena to identifying interesting physical events for further study—machine learning plays a key role in all such applications. While sophisticated models exist for these scenarios, inferring them efficiently remains challenging. Popular inference frameworks are often not well-suited for the high data influx rates at CERN since they lack flexibility, and do not offer fine-grained control.

To address these challenges, SOFIE (System for Optimized Fast Inference Emit) was developed. A tool that translates trained ML models into highly optimized C++ code, generating functions that can be easily integrated and called for inference. SOFIE converts models in ONNX format into its own intermediate representation and also offers support for models trained in Keras, PyTorch, and message-passing GNNs from DeepMind’s Graph Nets library. With the core functionality developed in C++, SOFIE comes with Python interfaces for easier access.

The key advantage of SOFIE is its ability to generate standalone C++ code that can be invoked directly within applications with low latency and minimal dependencies—requiring only BLAS for numerical computations. This enables seamless integration into high-energy physics workflows and other computationally demanding environments. Additionally, the generated code can be compiled at runtime using Cling’s Just-In-Time (JIT) compiler, allowing flexible execution, including within Python environments. By eliminating the need for heavyweight ML frameworks during inference, SOFIE offers a highly efficient and easily deployable solution.

Recent benchmarking results indicate that SOFIE outperforms ONNX Runtime for small-scale models in single-event evaluation. However, further optimizations are required to handle more complex models efficiently, especially in light of upcoming CERN upgrades expected to increase the data influx rate by 100x.

In this talk, we will explore SOFIE—CERN’s in-house tool for fast ML inference—its architecture, and the new optimization mechanisms developed to further reduce latency and minimize data movement. Although developed at CERN, SOFIE is applicable to other high-throughput environments such as autonomous vehicles, space exploration, industrial automation, fraud detection and beyond.

Outline

ML at CERN
- Areas of ML applications at CERN
- Difficulties and new implementations
Introducing SOFIE
- Motivation
- Why does CERN need super-fast inference of ML models with low latency and fewer dependencies?
- Why frameworks like TensorFlow or PyTorch aren't much help at CERN for ML Inference?
SOFIE Architecture
- Parser
- Model Storage
- Inference Code Generator
SOFIE in Action: Current support
SOFIE Optimization Methods
Space Optimization
Memory Planning
- Memory reuse with Partition first, Merge later algorithm
- Memory Allocators
- Allocation/Deallocation through SoA (Struct of Arrays)
- Custom user memory handlers
- Extending from CPU to GPU memory
- Operator Fusion and Elimination
- Dynamic Computation Support
- Kernel-level optimization
Time Optimization
Data caching avoiding copy
Matrix Calculations
Sparse data handling
Activation optimization
Kernel-level optimization
Benchmarking results
Future Goals

Pre-requisites

Intermediate knowledge of machine learning and the underlying mathematics will be helpful. The project is a tool for ML inference developed using C++ with Python interfaces through the C-Python API. Thus, a basic understanding of the required libraries will be beneficial. Familiarity with mathematical functions such as GEMM, ReLU, matrix multiplication, and hardware accelerators will be useful for following the latest developments of the project.

Advancements in optimizing ML Inference at CERN .ical 2025-10-01 14:40–15:10, Louis Armand 2 - Ouest

Outline

Pre-requisites

Advancements in optimizing ML Inference at CERN
.ical

2025-10-01 14:40–15:10, Louis Armand 2 - Ouest