2025-09-04 –, Room 1
The NVIDIA Grace Blackwell Superchip is a high-performance, ARM-based server platform designed for datacenter applications. It features a unified, cache-coherent memory subsystem that optimizes CPU-GPU interactions, facilitating efficient resource allocation. The system enables coherent memory access between the CPU and GPU via an NVLINK-based chip-to-chip interconnect, providing a unified memory view and allocation control at the OS level. GPU memory poison errors are managed through CPU firmware, while Address Translation Services (ATS) support allow a shared virtual address space between CPU and GPU.
NVIDIA vGPU extends these advanced capabilities to virtualized environments, enabling multi-tenancy and efficient GPU resource sharing across multiple virtual machines (VMs). Leveraging Multi-Instance Graphics (MIG), vGPU partitions GPUs into secure instances for independent VM assignment. Additionally, vSMMU support and PASID ensure process isolation within virtualized environments.
This presentation explores the system architecture of Grace Blackwell, detailing the design and implementation of vGPU to support these new platform-specific features. We will also discuss the status of the ongoing upstreaming efforts.
Ankit is an open-source developer working for NVIDIA on vGPU and Passthrough virtualization. He is currently working actively on providing virtualization support on NVIDIA Grace based systems.