Lessons learned from comparing Numba-CUDA and C-CUDA
2019-09-04, 16:45–17:00, Track 1 (Mitxelena)

We compared the performance of GPU-Applications written in C-CUDA and Numba-CUDA. By analyzing the GPU assembly code, we learned about the reasons for the differences. This helped us to optimize our codes written in NUMBA-CUDA and NUMBA itself.


Numba allows the development of GPU code in Python style. When a Python script using Numba is executed, the code is compiled just-in-time (JIT) using the LLVM framework. Using Python for GPU programming can mean a considerable simplification in the development of parallel applications compared to C and C-CUDA.

Python, however, has to live with the prejudice of low performance, especially in HighPerformance Computing.
We wanted to get to the bottom of whether this is really true and where these differences come from. For this reason, we first analyzed the performance of typical micro benchmarks used in HPC. By analyzing the assembly codes, we learned a lot about the difference between codes produced by C-CUDA and NUMBA-CUDA. Some of these insights have helped us to improve the performance of our application - and also of Numba-CUDA. With a few tricks it is possible to achieve very good performance with our Numba-Codes, which are very close - or sometimes even better than the C-CUDA versions.


Domains – Big Data, Parallel computing / HPC Domain Expertise – expert Python Skill Level – professional Project Homepage / Git Abstract as a tweet – NUMBA-CUDA: How to write efficient CUDA-Code in Python