2025-07-24 –, Main Room 3
GPU-accelerated sparse operations are notoriously difficult to implement efficiently. In addition, the GraphBLAS standard requires modularity as users can provide custom operators to use in the matrix multiplication, which usually doesn't mix well with handwritten GPU kernels. Leveraging KernelAbstractions.jl, we took a novel approach by JIT-compiling customized kernels for any user-defined operation while making it compatible cross-platform.
GraphBLAS is an API specification that aims to "define standard building blocks for graph algorithms in the language of linear algebra"(https://graphblas.org). A full implementation of the GraphBLAS API allows users to write a wide range of graph algorithms (from Breadth-First-Search to the Coloring problem) using the relationship between graph manipulation and linear algebra operations on the graph's adjacency matrix.
In this lightning talk, we will go over the main challenges encountered when building a sparse linear algebra framework on GPU, in particular when trying to implement the GraphBLAS API.
High memory to computation ratio
Sparse linear algebra functions are typically "memory-bound". In other words, there is a lot of data to read from memory, and very little computation to be done with each piece of data. These kind of function can be hard to efficiently adapt on GPUs as they can be "bottlenecked" by the GPUs memory bandwidth, and not fully use the GPU's large computing capacity.
Load balancing
Basic methods to parallelize sparse linear algebra operations will divide the workload on the matrix by "splitting" the matrix by rows or by columns. Consequently, if the distribution of non-zero elements in the matrix is not uniform across rows or columns, the workload of some threads will be higher than others, which can harm performances. This is a real problem for GPUs, as a lot of real-word applications have a high heterogeneity in non-zero distribution.
Modularity
The GraphBLAS API can be hard to bring to GPUs as it requires allowing the users to supply any custom operator to perform matrix-vector multiplication. While implementing custom kernels for every combination of supported operators could work, this can quickly spin out of control. Large codebases like this are hard to develop and maintain in addition to being more error-prone.
Up to now, we focused on the Modularity problem. To tackle it, we leverage some of the work previously done but the JuliaGPU community, in particular KernelAbstractions.jl. This allows us to write modular kernels parametrized on the user-supplied operators. These "high-level" kernels are then re-compiled Just-In-Time when called with new operators to generate efficient and specific low-level kernels.
I'm a student at the École Polytechnique Fédérale de Lausanne (EPFL). I'm currently pursuing a Master's degree in Data Science with a minor in Computational Science and Engineering. I use Julia to support research and prototype new techniques in high-performance computing and graph analysis.