2026-07-20 –, Room 1.38 (Ground Floor, Turing)
Parallel programming can be intimidating, but doesn’t need to be! Tile-based programming models make GPU parallelism more newcomer-friendly, highly productive, and still fast by letting you write sequential, array-centric code while the framework handles parallelization, synchronization, and data movement.
In this example-driven talk, we’ll introduce tile-based programming in Python using NVIDIA’s new stack: cuTile and its compiler foundation, Tile IR. You’ll see recently announced CUDA Tile capabilities in action, including multi-GPU communication, interoperability with traditional CUDA SIMT, and support for more diverse kernels such as convolutions and stencils. We’ll compare tile and SIMT approaches, build intuition for performance and execution, and demonstrate practical debugging and reasoning techniques. Along the way, you’ll see real workloads: HPC stencils, an SPMV plus CG solver, and ML models from TileGym. You’ll leave with a clear sense of when tile programming helps, and how it enables more portable high-performance Python as hardware trends evolve.
Parallel programming can be intimidating, but doesn't need to be! There's a new paradigm for parallel programming that's newcomer-friendly, highly productive, and performant: tile-based programming models.
Tile programming divides inputs into local arrays that are processed concurrently by groups of threads. Users write sequential array-centric code, and the framework handles parallelization, synchronization, and data movement behind the scenes.
In this example-driven talk, we'll introduce you to tile-based programming in Python. We'll present cuTile, NVIDIA's new tile programming stack and Tile IR, the new compiler stack that it is built with. You'll learn all about new features of CUDA Tile that have recently been announced, including multi-GPU communication, interoperability with traditional CUDA SIMT, and support for more diverse kernels like convolutions and stencils. We'll compare and contrast tile-based models with traditional parallel programming models. You'll see examples from a variety of domains, including HPC stencils, a sparse matrix vector (SPMV) and conjugate gradient (CG) solver, and AI models from TileGym.
Tile programming aligns well with SciPy's array-centric ethos and has roots in older HPC libraries, such as NWChem’s TCE, BLIS, and ATLAS. In recent years, many tile-based Python programming models for GPUs have emerged, like Triton, JAX/Pallas, and Warp, aiming to make parallelism more accessible for scientists and increase portability.
In this talk, you'll:
- Learn the best practices for writing tile-based Python applications for GPUs.
- Gain insight into the performance of tile GPU code and how it actually gets executed.
- Discover how to reason about and debug tile code in Python applications.
- Understand the differences between tile and SIMT programming and when each paradigm should be used.
- See how tile programming makes your software portable in light of recent hardware trends.
By the end of the session, you'll understand how tile-based GPU programming enables more intuitive, portable, and efficient development of high-performance, data-parallel Python applications for HPC, data science, and machine learning.
Bryce Adelstein Lelbach has spent over a decade developing programming languages, compilers, and libraries. He is passionate about parallel programming and strives to make it more accessible for everyone.
Bryce is a Principal Architect at NVIDIA, where he founded the Core C++ Compute Libraries team and now leads the Vanguard Programming group that drives NVIDIA's roadmap for programming languages, compilers, and core libraries.
He is a leader of the systems programming language community, having served as chair of the C++ Library Evolution and the US programming language standards committee. He has been an organizer and program chair for many conferences over the years. On the C++ committee, he has worked on concurrency primitives, parallel algorithms, senders, and multidimensional arrays.
He previously worked at Lawrence Berkeley National Laboratory and Louisiana State University. He is one of the founding developers of the HPX parallel runtime system.
Outside of work, Bryce is passionate about airplanes and watches. He lives in Midtown Manhattan with his girlfriend and dog.