EuroSciPy 2026

Profiling Python GPU Code
2026-07-20 , Room 2.41 (First Floor, Turing)

Your GPU is fast, so why does your Python code still feel slow? This talk shows a practical, Python-first profiling workflow with Nsight Systems, Nsight Compute, and NVTX for CuPy, Numba, PyTorch, JAX, and CUDA extensions. We will use timelines to find launch overhead, hidden synchronizations, and host-device copies, then drill into kernel bottlenecks like memory throughput and occupancy. You will leave with a repeatable loop for turning profiles into measurable speedups.


Your GPU is fast, so why does your Python code still feel slow?

When you accelerate Python with CuPy, Numba, PyTorch, JAX, or custom CUDA extensions, performance problems rarely look like a single slow kernel. They look like death by a thousand cuts: tiny launches, hidden synchronizations, accidental host-device copies, stream serialization, and kernels that are "fine" until you look at memory traffic. The good news is that NVIDIA's developer tools can make these issues obvious, if you know what to capture and how to read it.

In this talk, I'll show a practical, Python-first profiling workflow using Nsight Systems, Nsight Compute, and NVTX. We'll start at the top with system-level timelines to answer "where did the time go?" then drill down into kernel-level analysis to answer "why is this kernel slow?" Along the way, you'll learn how to annotate Python code with NVTX so your traces are readable, how to profile from notebooks and CI, and how to turn profiler output into a short, repeatable optimization loop.

Key takeaways:
- How to use NVTX ranges and markers from Python to make timelines explain themselves.
- How to capture the right Nsight Systems trace to spot launch overhead, sync points, copies, and stream issues.
- How to pivot from a timeline hotspot to Nsight Compute and choose metrics that actually answer your question.
- How to interpret common kernel bottlenecks (memory throughput, occupancy limits, instruction mix) without drowning in counters.
- A checklist for avoiding profiling traps (implicit sync, warmup, clock variability, sampling noise, and "profiling changed my code").
- A repeatable workflow you can apply to real Python GPU stacks, from single kernels to end-to-end pipelines.

By the end, you'll be able to profile Python GPU code with intent, isolate the bottleneck you actually have, and make changes you can measure and defend.


Expected audience expertise: Domain: expert Expected audience expertise: Python: some Project homepage or Git: Project homepage or Git Your relationship with the presented work/project: Original author or co-author

Bryce Adelstein Lelbach has spent over a decade developing programming languages, compilers, and libraries. He is passionate about parallel programming and strives to make it more accessible for everyone.

Bryce is a Principal Architect at NVIDIA, where he founded the Core C++ Compute Libraries team and now leads the Vanguard Programming group that drives NVIDIA's roadmap for programming languages, compilers, and core libraries.

He is a leader of the systems programming language community, having served as chair of the C++ Library Evolution and the US programming language standards committee. He has been an organizer and program chair for many conferences over the years. On the C++ committee, he has worked on concurrency primitives, parallel algorithms, senders, and multidimensional arrays.

He previously worked at Lawrence Berkeley National Laboratory and Louisiana State University. He is one of the founding developers of the HPX parallel runtime system.

Outside of work, Bryce is passionate about airplanes and watches. He lives in Midtown Manhattan with his girlfriend and dog.

This speaker also appears in: