KVM Forum 2025

Preserving VFIO PCI Devices During Kernel Live Updates

Vipin Sharma

Typically, updating a host kernel requires live-migrating virtual machines (VMs) to other hosts. However, this approach isn't feasible for VMs that rely on GPUs or for large-scale Language Model (LLM) training clusters spread across numerous hosts, where migration is complex and disruptive.

To address this challenge, Google is developing a Live Update mechanism [1]. This feature allows devices assigned to VMs or the Virtual Machine Monitor (VMM) via VFIO (Virtual Function I/O) to remain operational even as the host transitions to a new kernel using Kexec.

VFIO PCI device preservation is the key enabling technology here. It ensures that a PCI device can continue its direct memory access (DMA) and interrupt operations without being reset while the host kernel undergoes a Kexec-based update. Achieving this requires significant modifications to the VFIO, IOMMU (Input/Output Memory Management Unit), and PCI subsystems.

This talk will delve into Google's approach to preserving VFIO PCI devices during live kernel updates and the challenges encountered during its development.

[1] https://lore.kernel.org/lkml/20250515182322.117840-1-pasha.tatashin@soleen.com/

09:15

Single-binary: Unify QEMU system binaries per target architecture

Pierrick Bouvier

QEMU has been historically designed for having a different binary for
each target. Nowadays, with the advent of heterogeneous systems, it has
become a barrier to be able to emulate those. As a first step, we have
been working on the QEMU architecture to be able to build multiple
targets together in the same binary. This is what we call the ‘single binary’.

In this presentation, we'll introduce the approach we chose and the
challenges we met on the road, from the build system to target code,
going through a wide range of QEMU subsystems. Finally, we'll give you a
status of this project, and what will be our next steps.

09:45

Upstreaming NVIDIA vGPU Support: Architecture, Implementation, and Roadmap

Zhi Wang

NVIDIA vGPU technology brings high-performance GPU capabilities to virtualized environments, supporting a wide range of workloads - from graphics-intensive virtual desktops to AI and data science applications. Enabling GPU resource sharing or exclusive assignment on physical GPUs deployed in cloud or enterprise data centers combines the performance benefits of NVIDIA hardware with the flexibility and manageability of virtualization.

Moving upstream, we propose a software architecture based on SR-IOV, where each vGPU is represented by a PCI Virtual Function (VF) managed through the standard Linux VFIO framework. The NVIDIA vGPU VFIO driver, implemented as a VFIO variant driver, exposes standard userspace interfaces and supports critical features such as vGPU type selection, runtime creation and teardown of vGPU instances, and live migration. At its core, the driver interacts with NVKM, a core driver responsible for managing hardware. The architectural goal is to let NVKM support the DRM for host graphics, other NVIDIA GPU use cases, and the VFIO driver for vGPU.

Attendees will gain insight into the design architecture and upstream changes. We will also share our upstream roadmap and areas where community input is most needed.

09:45

virtual secure boot in 2025 -- the confidential computing edition

Gerd Hoffmann

Roughly ten years ago secure boot support for virtual machines made
its debut. Available for x86 architecture and q35 machine type,
building on SMM emulation in qemu and kernel, essentially following
what physical hardware is doing.

Since then the world has moved forward, putting up a number of
challenges for secure boot support.

confidential computing - SEV-ES, SEV-SNP and TDX are by design
incompatible with SMM emulation because the host has no access to
guest register state (which is needed to emulate SMM context
switch).
aarch64 platform - el3 aka secure world emulation (roughly
compareable to SMM mode) is unlikely to happen anytime soon.
riscv64 platform - simliar to aarch64 (except it's named supervisor
mode there).
CONFIG_KVM_SMM - kvm support for SMM emulation is optional now.
Proposed by google at kvm forum, to reduce kvm complexity, was
merged in 2022.

This will talk will discuss how secure boot can be supported without
depending on SMM emulation and it will present the work in various
projects (tianocore edk2, qemu, coconut svsm) to make that happen.

10:15

NVIDIA vGPU Support on Grace Blackwell Superchip: Architecture, Design, Upstreaming Status

Ankit Agrawal

The NVIDIA Grace Blackwell Superchip is a high-performance, ARM-based server platform designed for datacenter applications. It features a unified, cache-coherent memory subsystem that optimizes CPU-GPU interactions, facilitating efficient resource allocation. The system enables coherent memory access between the CPU and GPU via an NVLINK-based chip-to-chip interconnect, providing a unified memory view and allocation control at the OS level. GPU memory poison errors are managed through CPU firmware, while Address Translation Services (ATS) support allow a shared virtual address space between CPU and GPU.

NVIDIA vGPU extends these advanced capabilities to virtualized environments, enabling multi-tenancy and efficient GPU resource sharing across multiple virtual machines (VMs). Leveraging Multi-Instance Graphics (MIG), vGPU partitions GPUs into secure instances for independent VM assignment. Additionally, vSMMU support and PASID ensure process isolation within virtualized environments.

This presentation explores the system architecture of Grace Blackwell, detailing the design and implementation of vGPU to support these new platform-specific features. We will also discuss the status of the ongoing upstreaming efforts.

10:15

The State of QEMU WebAssembly Port

Kohei Tokunaga

QEMU's system emulator has recently merged initial support for Emscripten-based cross-compilation to WebAssembly (Wasm) in its 32bit TCI mode. Since Wasm is a binary format widely supported by modern browsers, this enhancement enables QEMU to run directly within the browser, opening up new use cases such as web-based playgrounds.

In this talk, Kohei will discuss this feature and its implementation. He'll also share the current status of ongoing discussion, including support for 64bit guests, a Wasm-based TCG backend and broader device support.

10:45

30min

Coffee break

De Donato

10:45

30min

Coffee break

3.0.2

11:15

Improving Windows Hypervisor-Protected Code Integrity (HVCI) Performance on KVM

Jon Kohler, Sergey Dyasli

Enabling Windows HVCI on KVM currently poses significant performance challenges due to missing hardware acceleration enablement. This talk will briefly cover the value of HVCI, why Microsoft wants this enabled by default in Windows 11 and Server 2025, and provide details on our proposed KVM improvements to leverage hardware acceleration from both Intel and AMD.

Preexisting hardware acceleration support exists in the form of both Intel Mode Based Execute Control (MBEC) and AMD Guest Mode Execute Trap (GMET). Exposing these processor capabilities requires targeted modifications to KVM MMU and vendor CPU feature enablement code. In addition to implementation details, we’ll be providing detailed performance benchmarks for the current state and observed performance improvements.

11:15

The next generation QEMU functional testing framework

Thomas Huth, Daniel Berrange

In the course of the past year, the functional tests of the QEMU project have completely been rewritten: Instead of using the Avocado test runner and its libraries, the tests have been adapted to the meson test runner with newly implemented, more lightweight library functions instead. This talk will show why this huge effort has been made, and talk about the hurdles and design decision that we took to get to the final goal.

11:45

Making io_uring pervasive in QEMU

Stefan Hajnoczi

In 2019 Linux introduced io_uring as an asynchronous I/O interface that minimizes system call overhead. Since then io_uring has expanded beyond file I/O to become a general-purpose asynchronous system call interface. This presentation discusses recent changes and the next steps for QEMU's io_uring support.

As more Linux kernel features are exposed through io_uring, QEMU components will increasingly need to call it. This led to the development of the new QEMU aio_add_sqe() API that allows custom io_uring operations to be submitted and integrates with QEMU's event loop.

Making io_uring accessible in the event loop also led to enabling io_uring-based file descriptor monitoring in QEMU's event loop. Instead of using ppoll(2) or epoll(7) to wait for events, io_uring can drive the whole event loop.

Come find out about the challenges and performance of these changes, as well as use cases for io_uring in QEMU. This talk is for developers interested in using io_uring themselves in QEMU as well as anyone learning more generally about how applications can take advantage io_uring.

11:45

Windows on Arm on QEMU/KVM: Challenges and Solutions

Akihiko Odaki

Microsoft released an RTM build of Windows on Arm last year on their website, and Linaro provides instructions for running it on QEMU/KVM. Now we can run Windows on Arm on QEMU/KVM flawlessly, or can we?

Despite basic configuration working with TCG, experiments on Asahi Linux revealed the reliability and functionality of a Windows VM on Arm proved to be far from par with Windows on x64 or Linux on Arm. Key issues included:
- QEMU and KVM struggled with PMU (Performance Monitoring Unit) emulation, a critical requirement for Windows.
- The virtio-gpu graphics driver, essential for features like high and variable display resolution, frequently crashed.
- The SPICE guest agent, necessary for features such as clipboard sharing, failed to function.

These hurdles necessitated multiple patches to update the entire virtualization stack. This presentation will demonstrate how these changes not only enhance the Windows on Arm experience but also improve Windows guest and Arm virtualization experiences overall. Lastly, I'll share insights gained from bringing up such an exotic platform and discuss future work.

12:15

90min

Lunch

De Donato

12:15

90min

Lunch

3.0.2

13:45

NeVer again: the last KVM/arm64 rewrite?

Marc Zyngier

Nested Virtualisation (NV) support for KVM/arm64 is expected to go live in Linux v6.16, should everything work according to plan.

Although an initial patch series had been maintained out of tree since 2017, its level of complexity was too high (and admittedly quality too low) to be seriously considered a merge candidate.

It took some effort to significantly refactor KVM/arm64 to a point where the NV support would be maintainable by drastically reducing its complexity, while ensuring the changes would will benefit non-NV setups. It also took time for the architecture to reach a point where supporting NV in KVM was actually worth the effort.

This isn't the first time KVM/arm64 undergoes a major redesign. But this instance radically changes the way new architectural features are introduced to the hypervisor. This has been achieved in part by using the ARM Architecture Machine Readable Specification (AARCHMRS), which was recently released under a permissive license. This allowed the modelling of a sizeable chunk of architectural behaviour. Not only does this ensure compliance with the specification, it also helps find issues with it.

This talk will describe why such a formalism was needed, how it has been put to a good use, what other challenges were tackled to get to this point, and what remains to be done.

13:45

rust-vmm: updates, adoption, and future directions

Stefano Garzarella, Patrick Roy, Ruoqing He

It has been several years since the last rust-vmm update at KVM Forum, but the community has continued to grow. Our goal remains the same: to provide reusable Rust crates that make it easier and faster to build virtualization solutions.

This talk will present the main progress and achievements from the past few years. It reviews how rust-vmm crates integrate into projects such as Firecracker, Cloud Hypervisor, libkrun, and virtiofsd. We will cover recent work supporting new architectures like RISC-V and additional operating systems. The talk will also discuss plans to consolidate all crates into a single monorepo to simplify development and releases. Finally, we will review the support for virtio and vhost-user devices that can be used by any VMM.

14:15

IOMMU in rust-vmm, and new FUSE+VDUSE use cases

Hanna Czenczek, Eugenio Pérez

We’ll give an overview over the IOMMU model in vhost-user and efforts to integrate support in the rust-vmm ecosystem. Doing so requires changes to the memory model and implementing the vhost-user protocol part, so is an effort across various crates in the ecosystem, from vm-memory up to vhost-user-backend.

Presenting these changes and why they’re necessary will also give general insight into how all of these crates even work together in the first place, which we hope will serve as a good introduction to the ecosystem.

Adding IOMMU capabilities to these crates also enables interesting use cases, especially related with VDUSE exposed through vhost vdpa and virtio vDPA. This allows exposing vhost-user devices to containers through a vhost-user to VDUSE bridge.

Talking about combining virtiofs and VDUSE, there is another interesting combination: To expose FUSE filesystems through VDUSE. Again, this allows the existing (and varied!) ecosystem of FUSE apps
to be exposed to containers and VMs, without the need to modify any of the FUSE app, the guest, or the containerized app.

14:15

PPaPaarraraallllelelelll vCPU onlining for arm64

Will Deacon

CONFIG_HOTPLUG_PARALLEL was introduced to the kernel to enable parallel booting of CPUs, primarily to accelerate the application of microcode updates on x86. However, much of the logic driving the onlining is implemented in core code and so this talk will cover the grotty details of enabling it for arm64 and reveal whether or not it can accelerate the onlining of vCPUs under KVM.

14:45

Optimizing vPMU on ARM

Colton Lewis

KVM's current vPMU implementation on ARM traps and emulates the PMU in entirety. This is a significant cause of overhead for any use of performance monitoring capabilities inside a guest.

This talk will explain my work over the past several months to improve the matter. [1] Relying on modern ARM CPU features such as PMUv3 and FGT (fine grain traps), it becomes possible to selectively untrap the most common PMU registers and features to allow guests direct hardware access to cut the overhead and significantly improve performance. A more detailed explanation with some notable performance improvements can be found in my cover letter on the kvmarm mailing list.

[1] https://lore.kernel.org/kvmarm/20250602192702.2125115-1-coltonlewis@google.com/

14:45

Virtio 2025 state of the union

Michael S. Tsirkin

A lot has happened in virtio land in the last year - new faces, new devices, new drivers, new functionality.
There's new work on testing, and a lot more!
This will give an overview of where we are and what to expect in 2026 and beyond.

15:15

Rust firmware for EFI direct kernel boot on mach-virt/arm64

Ard Biesheuvel

Superfast boot is important for micro-VMs, and this is usually accomplished by booting the kernel directly from the VMM, rather than going through the usual firnware and bootloader. EFI is typically avoided in these cases, as it has a reputation for being slow and buggy on x86.

On arm64, the situation is a bit different: without firmware, the kernel is entered with MMU and caches disabled, which poses its own set of problems. And without EFI, accessing ACPI and SMBIOS tables is problematic as well.

This talk describes an alternative proposal for doing direct kernel boot on arm64 virtual machines: a minimal re-implementation of EFI in Rust, tightly coupled with QEMU to boot the guest in kernel in EFI mode with all caching and memory protections enabled from reset. I will explain why it is faster and more secure, and results in less maintenance overhead than the non-firmware case.

15:15

Towards new migration protocol with unified channels

Prasad Pandit

QEMU live migration moves a running virtual machine from one host to another. While the basic concept of live migration is fairly simple, there is a lot of complexity in the current implementation. Current implementation has evolved over many years with different features added at different times to serve specific migration needs, while migration lost its place as one unit. Consequently, we now have limitations like TCP connections (aka channels) are uni-directional, they come up and shut-down asynchronously while migration is running, multifd migrates only RAM state, Postcopy can not use multifd channels etc.

To make it all work in practice, additional coordination is required between QEMU and management layer like Libvirtd(8). Features (eg postcopy-preempt) available in QEMU may not be usable from virsh(1)/libvirtd(8) side, because they need to be taught to handle these new features.

In this session, we'll look at these implementation details and discuss possible way(s) to improve things through a robust migration protocol which could accommodate all of the current requirements and allow for future enhancements, while keeping the overall architecture simple and intuitive.