Emmanuel Pilliat
I am an assistant professor in statistics at ENSAI (Rennes), working on machine learning and high-dimensional statistics. My current fields of research include crowdsourcing, change-point detection, bandit theory, dimension reduction, and clustering. My PhD focused on change-point detection and ranking problems.
I am also interested in high-performance computing with Julia. I develop Luma.jl, a package for portable GPU primitives like matrix-vector operations, prefix sum, mapreduce and copy that aim to match vendor-optimized performance, and KernelIntrinsics.jl, which provides currently missing low-level intrinsics.
Session
GPU vendor libraries like cuBLAS deliver excellent performance but come with hard constraints: limited type support, fixed operators, and single-vendor hardware. The Julia GPU ecosystem addresses portability through an abstraction layer: KernelAbstractions.jl lets developers write kernels that compile across CUDA, AMD, Intel, and Apple backends. But abstraction currently comes at a cost: KA.jl lacks the intrinsics needed for fully optimized performance. Warp operations on extended types, vectorized memory access, and explicit memory ordering for inter-workgroup communication are missing. We introduce KernelForge.jl, a Julia package proving that portable GPU code can match vendor-optimized performance. To make this possible, we developed KernelIntrinsics.jl, which exposes the missing primitives (currently CUDA-only, though the approach extends to other backends). KernelForge.jl provides kernels for matrix-vector and vector-matrix products with arbitrary operators and bitstype elements, mapreduce over 1D and 2D arrays, prefix scan, and copy operations. Each is implemented as a single kernel using vectorized loads/stores to saturate memory bandwidth as much as possible, warp-level reductions, and strong memory ordering for correct inter-workgroup synchronization. Benchmarks show that KernelForge.jl matches or exceeds both proprietary CUDA functions and NVIDIA's CUB library. The kernels are stable and tested, though views and strided arrays are not yet supported. The goal is straightforward: open-source GPU code that is efficient, flexible, and eventually portable.