AcceleratedKernels.jl: Cross-Architecture Parallel Algorithms JuliaCon 2025

AcceleratedKernels.jl: Cross-Architecture Parallel Algorithms
.ical

2025-07-25 15:00–15:30, Cathedral Room 324 - Else Room

AcceleratedKernels.jl is a unified, backend‐agnostic library for high-performance parallel algorithms in Julia. Built on KernelAbstractions.jl, it lets you write “once” and run everywhere—supporting multithreaded CPUs and GPUs (CUDA, ROCm, oneAPI, Metal) from a single codebase. In this talk I explain the design, implementation, and 200-GPU benchmark results of AcceleratedKernels.jl and show how to write portable code that runs on different hardware without modification.

In this talk I present AcceleratedKernels.jl, a library that provides a unified interface for writing parallel algorithms in Julia. The library is built on KernelAbstractions.jl, which allows high-level Julia code to be compiled into efficient kernels for a range of hardware. AcceleratedKernels.jl supports both multithreaded CPUs and GPUs from several vendors (CUDA, ROCm, oneAPI, Metal) using a single codebase. This design removes the need to write separate code for each target, making it easier for developers to write and maintain high-performance applications.

Key points in the talk include:

Unified Codebase: I describe how the same Julia user-code can be used to produce high-performance kernels for different hardware.
Performance Benchmarks: I will present benchmark results that compare AcceleratedKernels.jl with traditional implementations. Benchmarks for operations like sorting, mapreduce, and arithmetic computations show that the performance of kernels generated by AcceleratedKernels.jl is comparable to that of code written in C with OpenMP (on CPUs) and vendor libraries like Nvidia Thrust (on GPUs). These tests have been run on different architectures, from desktop CPUs to data-center GPUs, and the results demonstrate competitive speed and scalability.
Developer Experience: I will show how to write custom kernels in Julia with minimal changes to existing code - with the aim of writing a user application / library that transparently works across architectures, without special-cased kernels for GPUs or explicit multithreading. This also allows composable CPU-GPU co-processing across Julia libraries.
Real-World Applications: I will discuss several use cases from scientific computing and industry where the ability to run the same code on different hardware is valuable. Examples include multi-node data sorting and numerical simulations - in particular Lagrangian simulations such as the Discrete Element Method, Molecular Dynamics, or N-Body Simulations - where parallel execution is critical.
Future Work: I will outline planned improvements for AcceleratedKernels.jl, such as adding automated tuning for algorithm parameters, extending the range of available algorithms, and supporting emerging hardware platforms. I also discuss how contributions from the community can help shape the future of the library.

AcceleratedKernels.jl was created to simplify parallel programming by reducing the need for hardware-specific code. Instead of writing separate kernels for each target, developers write a single function that runs across all supported devices.

The talk will also include a live demonstration. I will write a simple kernel in Julia and show how it runs on both a CPU and a GPU without any modifications. I will discuss some challenges encountered during development, such as algorithm and interface design choices.

Finally, I will place AcceleratedKernels.jl within the broader Julia ecosystem and show its composability across separate libraries.

In summary, this session provides a detailed overview of AcceleratedKernels.jl, covering its design, performance, and practical applications. Attendees will learn how to write portable parallel code in Julia using a single, unified API and understand the trade-offs involved in cross-architecture programming. This talk is aimed at developers, researchers, and anyone interested in high-performance computing with Julia, and it offers practical insights into writing code that runs efficiently on modern hardware.

Andrei-Leonard Nicusan

Andrei-Leonard Nicușan is a final-year doctoral researcher in the University of Birmingham’s School of Chemical Engineering and CTO of EvoPhase Ltd., an AI in industry spinout. He published featured articles and Scientific Highlights on machine learning-based algorithms, metaprogramming-driven evolutionary optimisation, simulational-experimental calibration and positron imaging, based on which he won the 2024 IChemE Young Engineers Award for Innovation and Sustainability; his open-source frameworks are actively being used in academia and industry, with work in partnership with GlaxoSmithKline winning the 2023 “Best Use of HPC in Industry” award from HPCWire.

AcceleratedKernels.jl: Cross-Architecture Parallel Algorithms .ical 2025-07-25 15:00–15:30, Cathedral Room 324 - Else Room

AcceleratedKernels.jl: Cross-Architecture Parallel Algorithms
.ical

2025-07-25 15:00–15:30, Cathedral Room 324 - Else Room