JuliaCon 2026

Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs
2026-08-13 , Room 6

We implemented a mixed-precision, nested recursive Cholesky algorithm for GPU Matrix Processing Units (NVIDIA H200, AMD MI300X) using Julia. With a hierarchical precision method, we maximize throughput while maintaining numerical stability. Our recursive SYRK achieves a 14x speedup over cuBLAS, leading to a 5.32x overall speedup for Cholesky over cuSOLVER FP64. The solver leverages Julia’s multiple dispatch to provide a portable interface for HPC.


Symmetric linear solves are fundamental to a wide range of scientific and engineering applications, from climate modeling and structural analysis to machine learning and optimization. These workloads often rely on Cholesky (POTRF) decomposition and its supporting operations - triangular solves (TRSM) and symmetric rank-k updates (SYRK) - which together form the computational core for solving symmetric positive definite systems. To accelerate these kernels, we present a portable, mixed-precision solver designed for Matrix Processing Units (MXUs), including NVIDIA Tensor Cores (H200) and AMD Matrix Cores (MI300X). Our algorithm builds on a nested recursive formulation in which Cholesky exposes parallelism through recursive decomposition of its TRSM and SYRK subproblems, incorporating the first recursive GPU implementation of SYRK. This structure yields a hierarchical recursion that maximizes GEMM throughput while enabling fine-grained control over numerical precision. We introduce a custom recursive data structure that assigns low-precision FP16 arithmetic to large off-diagonal blocks, while preserving high precision on diagonal blocks to ensure numerical stability. To mitigate the limited dynamic range of FP16, we integrate a lightweight block-wise quantization scheme that prevents numerical overflow.
The solver is implemented in Julia, leveraging array programming, multiple dispatch, and dynamic type inference to enable seamless expression of mixed-precision computation. This design provides a high-level, hardware-agnostic interface while efficiently interfacing with low-level vendor libraries for backend portability. On H200, our recursive FP64 SYRK achieves a 14x speedup over cuBLAS, while mixed-precision delivers up to 27.0x speedup in SYRK and 5.3x in TRSM over full-precision baselines. This results in a 5.32x overall speedup for Cholesky versus cuSOLVER FP64, with 100x better accuracy than pure FP16 while retaining 88% of its peak speedup. Comparable performance and accuracy trends are observed on MI300X, demonstrating broad applicability across GPUs.

Vicki Carrica is a Computer Science and Engineering undergraduate at the Massachusetts Institute of Technology (MIT) graduating in 2027. As a researcher in the MIT Julia Lab, she contributes to the development of high-performance linear algebra routines, focusing on GPU acceleration and algorithmic efficiency.