Advancements in optimizing ML Inference at CERN
2025-10-01 , Louis Armand 2 - Ouest

At CERN—the European Organization for Nuclear Research—machine learning is applied across a wide range of scenarios, from simulations and event reconstruction to classifying interesting experimental events, all while handling data rates in the order of terabytes per second. As a result, beyond developing complex models, CERN also requires highly optimized mechanisms for model inference.

From the ML4EP team at CERN, we have developed SOFIE (System for Optimized Fast Inference code Emit), an open-source tool designed for fast inference on ML models with minimal dependencies and low latency. SOFIE is under active development, driven by feedback not only from high-energy physics researchers but also from the broader scientific community.

With upcoming upgrades to CERN’s experiments expected to increase data generation, we have been investigating optimization methods to make SOFIE even more efficient in terms of time and memory usage, while improving its accessibility and ease of integration with other software stacks.

In this talk, we will introduce SOFIE and present novel optimization strategies developed to accelerate ML inference and reduce resource overhead.


Experiments at CERN generate a tremendous amount of data every second, which is then analyzed for various discoveries and processes. From simulating physical phenomena to reconstructing the tracks of fundamental particles using GenAI architectures like diffusion models, from applying anomaly detection to uncover unknown physics phenomena to identifying interesting physical events for further study—machine learning plays a key role in all such applications. While sophisticated models exist for these scenarios, inferring them efficiently remains challenging. Popular inference frameworks are often not well-suited for the high data influx rates at CERN since they lack flexibility, and do not offer fine-grained control.

To address these challenges, SOFIE (System for Optimized Fast Inference Emit) was developed. A tool that translates trained ML models into highly optimized C++ code, generating functions that can be easily integrated and called for inference. SOFIE converts models in ONNX format into its own intermediate representation and also offers support for models trained in Keras, PyTorch, and message-passing GNNs from DeepMind’s Graph Nets library. With the core functionality developed in C++, SOFIE comes with Python interfaces for easier access.

The key advantage of SOFIE is its ability to generate standalone C++ code that can be invoked directly within applications with low latency and minimal dependencies—requiring only BLAS for numerical computations. This enables seamless integration into high-energy physics workflows and other computationally demanding environments. Additionally, the generated code can be compiled at runtime using Cling’s Just-In-Time (JIT) compiler, allowing flexible execution, including within Python environments. By eliminating the need for heavyweight ML frameworks during inference, SOFIE offers a highly efficient and easily deployable solution.

Recent benchmarking results indicate that SOFIE outperforms ONNX Runtime for small-scale models in single-event evaluation. However, further optimizations are required to handle more complex models efficiently, especially in light of upcoming CERN upgrades expected to increase the data influx rate by 100x.

In this talk, we will explore SOFIE—CERN’s in-house tool for fast ML inference—its architecture, and the new optimization mechanisms developed to further reduce latency and minimize data movement. Although developed at CERN, SOFIE is applicable to other high-throughput environments such as autonomous vehicles, space exploration, industrial automation, fraud detection and beyond.

Outline

  • ML at CERN
    • Areas of ML applications at CERN
    • Difficulties and new implementations
  • Introducing SOFIE
    • Motivation
    • Why does CERN need super-fast inference of ML models with low latency and fewer dependencies?
    • Why frameworks like TensorFlow or PyTorch aren't much help at CERN for ML Inference?
  • SOFIE Architecture
    • Parser
    • Model Storage
    • Inference Code Generator
  • SOFIE in Action: Current support
  • SOFIE Optimization Methods
  • Space Optimization
  • Memory Planning
    • Memory reuse with Partition first, Merge later algorithm
    • Memory Allocators
    • Allocation/Deallocation through SoA (Struct of Arrays)
    • Custom user memory handlers
    • Extending from CPU to GPU memory
    • Operator Fusion and Elimination
    • Dynamic Computation Support
    • Kernel-level optimization
  • Time Optimization
  • Data caching avoiding copy
  • Matrix Calculations
  • Sparse data handling
  • Activation optimization
  • Kernel-level optimization
  • Benchmarking results
  • Future Goals

Pre-requisites

Intermediate knowledge of machine learning and the underlying mathematics will be helpful. The project is a tool for ML inference developed using C++ with Python interfaces through the C-Python API. Thus, a basic understanding of the required libraries will be beneficial. Familiarity with mathematical functions such as GEMM, ReLU, matrix multiplication, and hardware accelerators will be useful for following the latest developments of the project.

See also:

Sanjiban is a Doctoral Student at CERN, affiliated to the University of Manchester. He is researching on optimization strategies for efficient Machine Learning Inference for the High-Luminosity phase of the Large Hadron Collider at CERN within the Next-Gen Triggers Project. Previously, he was a Summer Student at CERN in 2022, and also contributed at CERN-HSF via the Google Summer of Code Program in 2021. In the development of SOFIE, he was particularly involved in the development of the Keras and PyTorch Parser, storage functionalities, machine learning operators based on ONNX standard, Graph Neural Networks support, etc. Moreover, he volunteered as a Mentor for the contributors of Google Summer of Code 2022, and again in 2023, 2024 and 2025, and the CERN Summer Students of 2023 working on CERN’s ROOT Data Analysis Project.

Previously, Sanjiban spoke at PyCon India 2023 about Python interfaces for Meta’s Velox Engine. He also presented a talk on the Velox architecture at PyCon Thailand 2023. He has been contributing to open-source projects on data science and engineering that includes ROOT, Apache Arrow, Substrait, etc.