JuliaCon 2026

KernelForge.jl: Fast, Flexible GPU Computing Toward Portability
2026-08-13 , Room 3

GPU vendor libraries like cuBLAS deliver excellent performance but come with hard constraints: limited type support, fixed operators, and single-vendor hardware. The Julia GPU ecosystem addresses portability through an abstraction layer: KernelAbstractions.jl lets developers write kernels that compile across CUDA, AMD, Intel, and Apple backends. But abstraction currently comes at a cost: KA.jl lacks the intrinsics needed for fully optimized performance. Warp operations on extended types, vectorized memory access, and explicit memory ordering for inter-workgroup communication are missing. We introduce KernelForge.jl, a Julia package proving that portable GPU code can match vendor-optimized performance. To make this possible, we developed KernelIntrinsics.jl, which exposes the missing primitives (currently CUDA-only, though the approach extends to other backends). KernelForge.jl provides kernels for matrix-vector and vector-matrix products with arbitrary operators and bitstype elements, mapreduce over 1D and 2D arrays, prefix scan, and copy operations. Each is implemented as a single kernel using vectorized loads/stores to saturate memory bandwidth as much as possible, warp-level reductions, and strong memory ordering for correct inter-workgroup synchronization. Benchmarks show that KernelForge.jl matches or exceeds both proprietary CUDA functions and NVIDIA's CUB library. The kernels are stable and tested, though views and strided arrays are not yet supported. The goal is straightforward: open-source GPU code that is efficient, flexible, and eventually portable.


This talk presents KernelForge.jl, a Julia package for high-performance GPU computing. We first survey the current Julia GPU landscape—backend packages and abstraction layers—and identify what's missing for peak performance. We then introduce the intrinsics we developed in KernelIntrinsics.jl to fill these gaps. Finally, we show how KernelForge.jl uses these primitives to match or exceed vendor-optimized libraries like cuBLAS and CUB.

I The Current Julia GPU Landscape

The Julia GPU ecosystem is organized into two complementary layers: backend packages (CUDA.jl, AMDGPU.jl, oneAPI.jl) and abstraction packages (GPUArrays.jl, KernelAbstractions.jl, AcceleratedKernels.jl) that enable cross-architecture portability.

On the backend side, CUDA.jl relies on libcuda for memory copies and cuBLAS for vector dot products or matrix multiplications—these are highly efficient. However, cuBLAS is proprietary, supports only a restricted set of types and operators, and is available only for NVIDIA GPUs.

This raises a central question: is it possible to write open-source functions that are efficient, flexible, and portable?

Abstraction Side

KernelAbstractions.jl (KA.jl) provides tools to write kernels that can be compiled for multiple backends. This works through method overriding: each backend implements its own version of core methods (e.g., _synchronize) using the @device_override macro. At compile time, KA.jl specializes the kernel based on the backend context and argument types, following standard Julia dispatch. The code is then converted into an LLVM intermediate representation before being compiled into low-level asm (PTX for CUDA).

GPUArrays.jl uses KA.jl notably for vector copy operations, which KA.jl makes straightforward to implement. AcceleratedKernels.jl provides reduction, scan, and sort functions built on KA.jl, achieving reasonable performance.

Our goal with KernelForge.jl is to outperform current cross-architecture libraries (AcceleratedKernels, but also Kokkos and Raja in C++), as well as CUDA.jl and native proprietary CUDA libraries, and to demonstrate that it is possible to develop open-source code that is efficient, flexible, and portable at the same time.

II Intrinsics Currently Missing from the Julia Ecosystem

In practice, CUDA, AMD, and Intel GPUs share a similar architecture. Cores (called SMs in CUDA) schedule groups of threads. Each group is composed of warps which are set of 32 perfectly synchronized threads. Warp size varies in function of the architecture, but the principle is the same. Warps communicate through shared memory; blocks communicate through global memory. The relationship between threads and memory can be seen as a producer/consumer model: threads issue read or write requests to global memory, which returns data according to its bandwidth. For compute-bound workloads like matrix multiplication, performance bottlenecks lie in optimizing computation within threads, since memory has time to keep up. But for memory-bound workloads like copy or prefix scan, the bottleneck shifts to memory access optimization.
Unfortunately, KA.jl does not currently expose the full set of intrinsics needed to achieve peak GPU performance, particularly on the memory access side. For KernelForge.jl, we have been developing these intrinsics in KernelIntrinsics.jl (currently available only for the CUDA backend, though extension to other backends is feasible):

  • Warp operations with extended types: GPUs execute threads in warps of 32 (CUDA) or similar groups (AMD, Intel). CUDA.jl does not support warp operations on extended types such as quaternions or NTuples.

  • Vectorized loads/stores: Loading multiple Float32 values simultaneously enables faster memory bandwidth saturation, yielding substantial performance gains, especially when data fits within L2 cache.

  • Fences and memory ordering: Explicit control over memory ordering is essential for correct and efficient inter-group communication on GPUs. In particular, this avoids kernel relaunching and global synchronization between blocks. Performance gain is particularly noticeable for kernels such as prefix scan, for which we use a decoupled lookback algorithm.

Our intrinsics design draws inspiration from UnsafeAtomics.jl (for its structural approach) and CUDA.jl.

III KernelForge.jl and Our Objectives

KernelForge.jl demonstrates that abstracted kernels built with KA.jl can achieve backend-level efficiency, at least on CUDA. We provide kernels for matrix-vector and vector-matrix products (supporting general operators and bitstype elements), mapreduce over 1D and 2D arrays, prefix scan, and copy operations that match libcuda-level bandwidth through vectorized loads/stores. Benchmarks available at KernelForge.jl show performance matching or exceeding proprietary CUDA functions and CUB. The package includes correctness tests and the kernels are stable, though views and strided arrays are not yet supported.

KernelForge.jl builds on KA.jl and KernelIntrinsics.jl (CUDA-only for now). Each operation is implemented with a single kernel, using vectorized loads and stores to saturate memory bandwidth, warp-level reductions for faster intra-warp computations, and strong memory ordering to enable correct inter-workgroup communication.

I am an assistant professor in statistics at ENSAI (Rennes), working on machine learning and high-dimensional statistics. My current fields of research include crowdsourcing, change-point detection, bandit theory, dimension reduction, and clustering. My PhD focused on change-point detection and ranking problems.
I am also interested in high-performance computing with Julia. I develop Luma.jl, a package for portable GPU primitives like matrix-vector operations, prefix sum, mapreduce and copy that aim to match vendor-optimized performance, and KernelIntrinsics.jl, which provides currently missing low-level intrinsics.