JuliaCon 2025

Toward Modern Linear Algebra: Single API Kernels for HPC
2025-07-23 , Main Room 4

Modern hardware like NVIDIA’s H100, GB100, and AMD’s MI300 accelerators demand flexible, high-performance software. DLA.jl modernizes dense linear algebra with a unified, hardware-agnostic API, while Dagger.jl enables dynamic task scheduling across CPUs and GPUs. Together, they provide scalable, efficient computation without vendor lock-in. This talk explores their impact on HPC, AI, and scientific computing, highlighting future directions in mixed precision and adaptive scheduling.


Background. Recent advancements in HPC hardware, such as NVIDIA’s H100 and GB100 GPUs and AMD’s MI300 accelerators, have significantly expanded computational power for scientific computing, AI, and large-scale simulations. These architectures provide high memory bandwidth, mixed-precision capabilities, and extensive parallelism, making them ideal for dense linear algebra. However, fully leveraging these capabilities remains challenging due to reliance on legacy numerical libraries.
Traditional vendor-optimized solutions like cuBLAS, rocBLAS, and oneMKL require separate implementations for different hardware, leading to fragmentation, reduced portability, and increased complexity. To address these challenges, we introduce DLA.jl, a hardware-generic dense linear algebra library, and Dagger.jl, a dynamic, task-based scheduler. Together, they provide a unified, scalable solution for matrix computations and dynamic workload distribution across heterogeneous architectures.
Contribution. DLA.jl eliminates the need for architecture-specific optimizations while ensuring high performance. By leveraging Julia’s multiple dispatch, type inference, and metaprogramming, it generates optimized machine code that runs efficiently on both CPUs and GPUs. The implementation supports multiple numerical data types such as Float64, Float32, Float16, and complex numbers, making it broadly applicable to scientific and engineering workloads. Benchmarks show DLA.jl performs on par with or better than LAPACK and cuSOLVER, all while being more flexible and extensible.
Meanwhile, Dagger.jl provides a dynamic scheduling mechanism that optimizes task execution across heterogeneous hardware. Unlike static workload partitioning, it dynamically adjusts computation assignments based on hardware availability, ensuring efficient utilization of both CPUs and GPUs. This approach improves load balancing, reduces memory overhead, and minimizes idle compute time, resulting in better overall performance. By combining DLA.jl for high-performance linear algebra with Dagger.jl for adaptive workload scheduling, we create a powerful framework for modern HPC challenges.
Research Impact. The integration of DLA.jl and Dagger.jl has far-reaching implications in scientific computing, HPC, and AI-driven applications. By offering a unified, hardware-agnostic API, DLA.jl enables high-performance computations on heterogeneous systems without requiring extensive platform-specific tuning. This is particularly beneficial for large-scale simulations in climate modeling, computational physics, and structural engineering.
Dagger.jl further enhances parallel execution efficiency by dynamically distributing tasks across CPUs and GPUs, making it crucial for AI and machine learning workloads. Efficient scheduling ensures deep learning models, statistical simulations, and large-scale pipelines run with minimal bottlenecks. Additionally, DLA.jl leverages recursive formulations and custom GPU kernel optimizations, achieving performance comparable to cuBLAS and rocBLAS, particularly for large matrix sizes.
A key research direction is integrating iterative refinement into DLA.jl, improving computational efficiency while maintaining accuracy. Dagger.jl will be extended to support hybrid scheduling strategies, allowing intelligent workload partitioning between CPU and GPU nodes, further optimizing large-scale scientific applications.
Relevance to the Julia Community. The development of DLA.jl and Dagger.jl is a major contribution to the Julia ecosystem, especially in scientific computing, parallel programming, and GPU acceleration. By providing a Julia-native implementation, DLA.jl reduces dependency on Fortran/C-based libraries while maintaining full compatibility with Julia’s numerical software stack. This makes developing and optimizing HPC applications more accessible without requiring low-level programming.

I am a postdoctoral researcher at the MIT JuliaLab and an HPC enthusiast who loves solving complex problems by thinking in parallel. My research intersects High-Performance Computing (HPC) and Artificial Intelligence (AI), exploring how advanced computational techniques can optimize AI algorithms for increased efficiency and effectiveness. I was honored as one of the Rising Stars in Computational and Data Sciences by U.S. Department of Energy. My collaborations extend internationally, including with the Innovative Computing Lab at the University of Tennessee and MINES ParisTech. In Summer 2021, I was a visiting scholar at the Innovative Computing Lab, where I contributed to a milestone of the Software for Linear Algebra Targeting Exascale (SLATE) project , a joint initiative of the U.S. Department of Energy’s Office of Science and the National Nuclear Security Administration (NNSA).

This speaker also appears in: