2024-07-12 –, Method (1.5)
GraphBLAS.jl is a sparse linear algebra package which provides both conformant sparse AbstractArrays and a novel array type for writing graph kernels. Since the last talk on this package a lot has changed: A cross language JIT between Julia and C, complete AbstractArray support, compiler-based backends including GPU support, and Julia-level kernel fusion are all recent additions to the package.
Initially named SuiteSparseGraphBLAS.jl a few years ago, the current edition of GraphBLAS.jl provides among the fastest sparse linear algebra libraries in any language today. At it’s core is the SuiteSparse:GraphBLAS C library which provides hand-tuned kernels for CPU execution of sparse and graph algorithms. However, GraphBLAS.jl now uses these kernels as one of several directed acyclic graph (DAG) execution strategies which includes library functions from CUDA / AMDGPU, and runtime compilation from multiple compilers like Finch.jl.
These new execution backends are supported by three separate array types, EagerGBMatrix
, LazyGBMatrix
and GBMatrix
. EagerGBMatrix
executes function calls in a blocking fashion, and performs minimal laziness or fusion (with the exception of lazy setindex!
). The LazyGBMatrix
performs no computation until a function is called to explicitly materialize the array, and any LazyGBMatrix
which is not explicitly materialized may never be computed. However this does not conform to the AbstractArray
interface, so the final type GBMatrix
performs this laziness only when it is safe to do so.
When laziness is enabled, or the lazy DAG is explicitly constructed, the GraphBLAS.jl runtime will first attempt to dispatch to hand-tuned kernels such as those found in CUDA/AMDGPU or SuiteSparse:GraphBLAS. SuiteSparse:GraphBLAS in particular provides a large number of function skeletons which are highly fusible. For instance, a matrix multiplication preceded and followed by multiple elementwise maps can be fused into a single function call. Using a novel Julia + C -> LLVM cross language JIT built for GraphBLAS.jl the runtime generates code using LTO which is as fast as C native implementations of the same functions. This functionality is used to implement the lazy complex adjoint, by fusing the conjugation into the matrix multiplication, as well as other common AbstractArray functionality. A similar library of function skeletons is implemented in Julia for various important kernels like Sampled-Dense-Dense-Matrix-Multiplication.
Other fused expressions or arrays are not able to be dispatched to these hand-tuned libraries. This may occur for more complex expressions, or for array types which are unsupported by library backends. A particularly common case is when the fill value for a sparse array is not equivalent to the identity and/or annihnilator of the function. For these cases a compiler is used to generate a new kernel, which is cached to avoid expensive recompilation. Several compiler backends are supported including Finch.jl, and a modified version of TACO. In the future this may be expanded to more compilers including MLIR Sparsifier, the Sparse Polyhedral Framework, and Etch.
In this talk we will briefly discuss the structure of DAG execution, before diving into performance numbers for a selection of common use cases and algorithms.
Currently I am an SM at the MIT JuliaLab, where I focus on sparse libraries and compilers.