KVM Forum 2024
Summary of the work done to KVM over the past year, highlighting the significant contributions that have landed upstream.
QEMU has long had a number of downstream forks that seek to take
advantage of its flexible TCG emulation layer and combine it with
various approaches to instrumentation. The TCG plugin sub-system
introduced 5 years ago was an attempt to provide for the needs of
instrumentation in an upstream compatible way. Recent enhancements
include the ability to read register values and a more efficient way
to implement thread safe counters.
Alex asks have we done enough to enable the more interesting use cases
such as binary analysis and fuzzing?
Is it time to revisit the limitations introduced to avoid GPL end-runs
and allow plugins to affect system state?
Are there any more features tools like AFL+ or ThreadSan need to be
able to introspect and analysis a system running in QEMU.
The Confidential Story
Rivers, dams and kernel development
For a new hardware feature to be available to users, Linux and often other levels of the virtualization stack have to support it. The time needed for development and upstream acceptance can be substantial and difficult to predict.
This talk will analyze the past, present and future of enabling confidential computing on both the kernel and the QEMU sides. It will show how hardware vendors can benefit from working as closely as possible with upstream communities during “in-house” development, and how this can reduce the friction caused by different approaches coming in concurrently from multiple hardware vendors. I will also present the work done by Red Hat and Intel as part of the CentOS Stream Virtualization SIG, and how a stable base kernel facilitates work on confidential computing at the higher levels of the stack.
While KVM and other virtualization tools dominate the scene, QEMU's TCG emulation deserves a second look. Sure, it excels at cross-development and retro gaming, but what if it could do more? This talk explores using TCG not just to mimic processors, but to create a new one.
Designing a processor microarchitecture involves a selection of various options. Implementing them all in silicon takes forever. A common methodology for microarchitecture design exploration is to use simulators. Simulators focus on replicating execution timing, eliminating details for faster implementation. However, they lack a mechanism to perform operations and make decisions about branches and memory access. Interpreters, a typical solution, are slow, unreliable, and feature-poor.
Here's where TCG shines. Compared to interpreters, TCG boasts impressive speed, supports multiple architectures, allows debugging with GDB, and handles both system and user space emulation. To showcase its potential, we describe a practical case of employing TCG for RISC-V emulation and integrating it with a simulator through a custom TCG plugin. In the process, we contributed to the upstream development of a new feature of the TCG plugin infrastructure to read registers, crucial for our use case. Finally, we discuss possibilities to extend QEMU to empower future microarchitecture research further.
The Confidential Story
Early development across the stack: living in stilt houses
In the second session of the Confidential Story, we will cover the adoption of the Confidential Compute stack which, at the time of this abstract submission, is not yet fully up streamed in the Linux Kernel, QEMU, or in the Kata and Confidential Containers projects.
Kata and Confidential Containers projects are two of the main consumers of the Confidential Compute stack, and aim to leverage it in the Cloud Native ecosystem, relying heavily on the work presented during the first part of this session. The Kata and Confidential Containers teams at Intel worked closely with distributions to make the development process easier and efficient for both developers and adopters of those projects.
We will cover the challenges of building a reasonable upper stack for early adopters, and doing so on top of moving pieces. We will show that working with moving parts is actually normal, and that collaboration is the way to make sure that a solution will be ready without many delays from the moment that the foundation is solid.
As cloud technologies continue to advance at a rapid pace, there arises a critical need to assess the performance disparities among various virtualization stacks. This presentation aims to shed light on the comparative performance, scalability, and efficiency of two prominent hypervisor technologies—KVM/QEMU and Linux as Root Partition for Microsoft Hyper-V with Cloud-Hypervisor as VMM —within the realm of nested virtualization. Through a comprehensive evaluation, we will scrutinize diverse performance metrics encompassing CPU utilization, memory consumption, I/O throughput, and latency across varying workloads and configurations. Also, we try to examine the guest attestation process and the security aspects within these distinct hypervisor stacks. By delving into these key aspects, we seek to offer valuable insights into the operational characteristics and suitability of each hypervisor technology for nested confidential guest environments.
The advent of the Software Defined Vehicle (SDV) comes with its unique set of opportunities and challenges. Consolidation of control units (CUs) into a single, central computer, implies the potential need for relying on Virtualization for running multiple Operating Systems on a single system.
For QEMU, this means new challenges. In addition of growing to support new paravirtual interfaces required by the Automotive Industry, it also needs to be able to provide optimal data paths for data intensive virtual devices, such as GPUs, NPUs, DSPs, video and media devices. And, at the same time, it needs to offer excellent uptime and accept being encapsulated in a context that guarantees proper spatial and temporal separation between it and other components in the system.
In this talk, I'll detail these challenges and the work we're doing to overcome them while building the Virtualization feature for Automotive Stream Distribution (AutoSD), the distribution maintained by the CentOS Automotive SIG.
Automated testing is a very important tool in modern software engineering, and is often implemented through runners in virtualized environments. Low-level domains, such as kernel development, have unique requirements that are not covered in these environments, but automation of all different kinds of hardware comes with additional challenges. In this talk, we will deep-dive into some of the challenges we encountered during development and use of SoTest, a custom-made service for automatically testing and benchmarking virtualization workloads on a broad variety of hardware, including:
-
How can we make arbitrary hardware remote-controllable?
-
How do we integrate these tests into the developer workflow?
-
How can we optimize the throughput?
-
How do we enable performance benchmarking?
-
How do we handle, e.g., network infrastructure flakiness?
The VM Privilege Level (VMPL) feature of SEV-SNP allows for privilege separation within an SEV-SNP guest. Each VMPL will require its own execution state for each vCPU. A Secure VM Service Module (SVSM) runs at the highest privilege level to provide services to lower privilege levels (such as a Linux guest OS). This talk looks to investigate how to maintain VMPL state for each guest vCPU and how to efficiently switch between VMPL levels of the guest vCPU.
The Android Virtualisation Framework supports the creation of confidential (aka "protected") guests which provide a native code environment for confidential payloads that require isolation from the rest of the Android Operating System. However, the guest kernel requires a little enlightenment to function usefully in a protected environment.
This talk will describe the protected VM environment provided by pKVM, the guest changes necessary for it to work properly, how it differs from some of the other CoCo efforts and finally demonstrate the guest-side changes running on top of the latest upstream kernel as a protected guest on a real Android phone.
The Linux PCI endpoint framework enables Linux to operate as a PCIe Endpoint device by interacting with hardware. It provides functions for describing PCIe configuration space content and transferring data via the PCIe bus. However, testing the framework and its implementations can be challenging due to the limited availability of real PCIe endpoint hardware for testing purposes. To address this limitation, we are proposing a virtual device that allows the PCIe endpoint framework to function without relying on physical hardware. This virtual device can improve the testability, leading to more robust and reliable implementations. In this session, we will introduce the design and implementation of this virtual device.
All virtual machines, in the most common use case, use system firmware present in the ‘standard’ path inside the hypervisor host while booting. For BIOS-booted VM, the firmware is normally SeaBIOS-based and for UEFI-booted VMs, it is edk2-based. Currently, when a cloud VM is launched, the firmware binary is supplied by the cloud provider and the end user has no control over it. For confidential VMs, this represents problems both for the end user and for the cloud provider.
- The end user gets firmware measurements for the attestation purposes, however, without an ability to provide a self built (or trusted) binary, these measurements can only indicate that the firmware hasn’t changed. The end user has to implicitly place some trust in the cloud-provider supplied firmware binary.
- The cloud provider can’t update the firmware (e.g. to fix a vulnerability) without disturbing user workloads. As firmware is included into launch measurements, just swapping the firmware will cause attestation errors. The problem is even worse for embargoed vulnerabilities.
This talk describes a method of supplying system (UEFI) firmware for VMs as part of the VM disk image. The cloud-provider would not need to look into/get access to the VM disk image. The VM will use the proposed mechanism to provide the firmware binary to the hypervisor. The hypervisor will use this mechanism to install the firmware binary into the guest ROM and regenerate the VM. Our initial approach will be solely based on QEMU/KVM/EDK2/UKI. The approach should eventually become widely adopted across the industry (other cloud providers, hypervisors/VMMs, etc ).
Our approach has several advantages compared to using an IGVM container image with an embedded firmware passed to the hypervisor when starting the guest.
- First of all, the firmware image is provided along with the guest VM image (using a UKI add-on). Therefore, the guest image and the firmware binary can be packaged together as one single unit. There is no need to store the firmware blob (inside an IGVM container) somewhere separately in the hypervisor host to pass it to the hypervisor when starting the guest.
- Secondly, the request to the hypervisor to install the firmware image is directly initiated by the guest. Therefore, the guest controls when to upgrade the firmware and which firmware image to upgrade to. There is no need for the hypervisor to make any decision on this issue. The hypervisor also does not need access to the VM image either.
- Lastly, it is possible to upgrade the firmware without re-deploying a new guest VM image (and a new IGVM image containing the new firmware image) . Upgrading to a new firmware image is possible by an already existing VM spawned from the current VM image by simply updating the UKI firmware add-on to a new updated PE binary and using the mechanism to install it in the guest ROM.
We intend to give a demo of our prototype in action.
We all know that debugging containers can be tough. In immutable pods, with limited capabilities and minimal packages, we can be additionally hindered without our favorite debugging tools or even the required privileges; it can be downright frustrating. And containerized virtual machines aka KubeVirt? It's another layer again!
But there's hope. Join us as we explore common problems and repeatable solutions to debug KubeVirt failures and how to mix Kubernetes debugging techniques with more traditional virtualization debugging methods.
In this session you will learn:
- How to do privileged debugging on the node
- How we can bring additional tools into the target container as a regular user
- How to use traditional VM debugging in Kubernetes
All of which will help you to find your way in the container world.
Live Updates are an increasingly successful way to update the host kernel and userland on a KVM host with very little downtime, and without the need for Live Migration. It works by leaving VM state in memory and switching to a new kernel via kexec without a full power cycle. For our cloud computing business @ Akamai, this is a game changer. Capacity constraints are a driving motivator for us to use live updates, particularly with the way we operate at the edge. We operate a product named Akamai Cloud Computing (formerly www.linode.com). Oftentimes, live migration is very not an option due to those capacity constraints. In working with the recently merged CPR (Checkpoint Restart) feature, we are putting this new QEMU functionality to good use. We are actively productionizing the use of Live Updates and in this talk we describe some of the challenges we went through to make Live Updates work. We're hoping that tales from a cloud provider will motivate more companies to get engaged with this powerful support from QEMU so that Live Updates can become a routine operation in the cloud.
Confidential computing - making VM guest secrets harder for
the hypervisor to access - is getting more and more important as
time goes by.
Virtio (and paravirtualization generally), fundamentally, can be thought of as
a means of improving guests by making use of hyprevisor functionality. To what
level can this still be beneficial when the guest does not want to fully trust
the hypervisor?
This talk will try to address these questions, by touching on the following
areas:
- review of new features / devices and how they interact with
confidential computing - status and plans of hardening (improving confidentiality)
with virtio on Linux - known open issues and how you can help
The COCONUT Secure VM Service Module (COCONUT-SVSM) is evolving from a service module for confidential VMs to a paravisor layer for running unenlightened operating systems. This talk will highlight the COCONUT-SVSM community's achievements in the past year and introduce the project's direction towards paravisor support.
While significant progress has been made, challenges remain within the COCONUT codebase and upstream adoption within the KVM hypervisor. The presentation will delve into proposed solutions to enable support for AMD SEV-SNP VMPLs and Intel TDX partitioning within KVM and QEMU. A particular focus will be placed on the intricacies and challenges associated with the IRQ delivery architecture.
Current QEMU live migration device state transfer is done via the (single) main migration channel.
Such way of transferring device state migration data reduces performance and severally impacts the migration downtime for VMs having large device state that needs to be transferred during the switchover phase.
Some examples of devices that have such large switchover phase device state are some types of VFIO SmartNICs and GPUs.
This talk describes the efforts to parallelize the transfer and loading of device state for these VFIO devices by utilizing QEMU existing support for having multiple migration connections - that is, by utilizing multifd channels for their transfer, together with other parallelization improvements.
The Coconut-SVSM is a platform to provide secure services to Confidential Virtual Machine guests. On AMD SEV-SNP, it runs inside the guest context at an elevated privilege level (VMPL).
SVSM is not yet able to preserve the state across reboots, so it provides services with limited functionality, such as a non-stateful virtual TPM for measured boot.
In this talk, we will describe the ongoing work towards stateful services, including a fully functional vTPM and a persistent and secure UEFI variable store, which can be employed for Secure Boot. This is achieved by adding encrypted persistent storage to the Coconut-SVSM, which is backed by the host hypervisor. The decryption key is received from the attestation server after a successful remote attestation during the early boot phase of the SVSM. The attestation covers the integrity of the platform, including SVSM and OVMF firmware. A host-side proxy is used to communicate with the server to keep the code in the SVSM context small.
During the talk we will look at the current challenges we are facing, potential attacks to defend against, and future developments to support a persistent state in SVSM.
Linux virtualization environments support memory overcommitment for VMs using techniques such as host-based swapping and ballooning. Ballooning is not a complete solution, and we have observed significant performance bottlenecks with the native Linux swap system. Swapping also degrades live migration performance, since QEMU reads a VM’s entire address space, including swapped-out pages that must be faulted in to migrate their data. QEMU accesses to pages during live migration also pollute the active working set of the VM process, causing unnecessary thrashing. As a result, both guest performance and live migration times can be severely impacted by native Linux memory overcommitment.
These problems motivated us to develop a custom memory manager (external to QEMU) for VM memory. We propose leveraging UserfaultFD to take full control of the VM memory space via an external memory manager process, exposed to QEMU as a new memory backend. QEMU requests memory from this external service and registers the userfaultFD of shared memory address spaces with the memory manager process. This approach allows us to implement a lightweight swap system that can take advantage of a multi-level hierarchy of swap devices with different latencies that can be leveraged to improve performance. More generally, gaining control over guest memory enables a wide range of additional optimizations as future work.
This approach also offers significant opportunities to improve live migration. With full visibility into the swap state of guest physical memory, we can avoid costly accesses to swapped-out pages, skipping over them during live migration. By using shared remote storage accessible to both the source and destination hosts, we transfer only their swap locations, instead of their page contents. This eliminates the page faults associated with swapped-out pages, and also reduces pollution of the guest's active working set.
We will present the design and implementation of our prototype userfaultFD-based memory overcommitment system, and explain how it interoperates with QEMU for effective VM memory management. We will also demonstrate its improved performance on several VM workloads, and discuss various tradeoffs and areas for future improvement.
Birds of a Feather sessions are a place for informal meetings where attendees group together based on a shared interest.
A topic lead (submitter) will propose a BoF for their area of interest during the first day of the conference and will drive the conversations.
Summary of the work done on QEMU over the past year, highlighting the significant contributions that have landed upstream.
QEMU: Let's talk about QMP, QAPI, and our user-facing API documentation generated by Sphinx.
-
Have you ever wondered what the difference between QMP and QAPI is, and have a deep-seated fear that not knowing the precise, technical answer will come to haunt you in five years when your new feature ships in an enterprise distribution?
-
Have you ever laid awake in bed at night wondering what exactly that new enum value you added actually changed in the QMP protocol, if anything?
-
Have you ever logged in to develop a new QEMU feature on Monday morning while slightly hung over and cursed out the QMP reference manual and/or your god(s) in a fit of rage while exclaiming "Someone ought to fix this!"?
It's me! I'm "Someone"! Come and see what we are cooking up, this talk is for you.
Recent developments in the QAPI generator and what this means for developers implementing new APIs and features are covered, as well as the new massive QMP user documentation overhaul project that will -- this time, we promise -- produce user-friendly, reliable, accurate, and aesthetically pleasing QMP documentation that will serve as our new gold standard that will help direct users of QMP and libvirt users alike.
Pleas of help for QEMU maintainers with relevant subject expertise to review/refresh QMP documentation will also feature prominently.
VFIO has transformed virtual machine (VM) performance by enabling
direct device assignment. This presentation delves beyond a status
report, showcasing the exciting advancements realized over the past few
years.
Central to these improvements is a comprehensive code refactoring
effort. The vfio-pci driver has been split into a core library, paving
the way for a new generation of variant drivers. These drivers unlock
device-specific functionalities, pushing the boundaries of VM
capabilities.
Examples include:
- Enhanced Migration: Leveraging the new vfio migration interface
(version 2), variant drivers from NVIDIA/Mellanox, Huawei/HiSilicon,
and Intel enable seamless VM migration with VFIO devices. - Advanced BAR Management: A dedicated NVIDIA variant driver
demonstrates the power of the core library by recomposing device PCI
BARs through a coherent memory region exposed by the SoC
interconnect. - Direct Device Access: A further refactoring of the VFIO container
and group code introduces a new character device interface (cdev).
This interface allows direct device access and native support for
VFIO devices utilizing the cutting-edge userspace IOMMU interface,
IOMMUFD.
This presentation delves into these advancements, along with other
exciting developments in the VFIO ecosystem. It will showcase how these
innovations are empowering users to achieve unprecedented levels of
performance and flexibility in virtualized environments.
Intel client platforms from Alderlake have begun to leverage hybrid CPU architectures, and hybrid CPU architectures can achieve a good balance of performance and power on bare metal. However, VMs are still unable to take advantage of the hybrid CPU architecture, not only because QEMU/KVM is unable to expose the P-core/E-core difference for VMs, but also because the P-core/E-core feature difference further blocks hybrid support in Guest. As a result, VM performance on hybrid platforms is very gapped from bare metal, and CPU features are not fully supported in Guest (e.g. PMU, which is disabled by KVM on hybrid platforms).
To address this, our presentation will mainly have the following aspects:
-
Illustrate our proposal on how to design the QEMU API to allow users to create a hybrid CPU/cache topology for Guest, with flexible CPU type configurations. This allows Guest to realize the difference between the P-core/E-core type of the vCPU and different cache topologies. We achieve this by abstracting the topology device in QOM way to refactor current QEMU general topology implementation. We will also specifically describe our QOM-topology implementation and how it would help QEMU to improve the general topology subsystem.
-
Present of our exploratory experience on VM performance optimization method could be applied for the VM with hybrid vCPUs. On Intel client platforms, there's ITD/ITMT, etc., which could optimize workload in VM based on hybrid CPU/cache topology.
In summary, this presentation will cover all aspects of reaching optimal CPU virtualization on hybrid platform for both performance and features.
Multi-tenant cloud environments demand secure and cost-effective workload isolation. Single Root I/O Virtualization (SR-IOV) tackles this challenge by extending PCI multifunction's capabilities. It introduces lightweight and isolated "virtual functions (VFs)" managed by a central "physical function (PF)". A PF exposes interfaces to configure the device for specific scenarios and optimize resource allocation.
For example, SR-IOV-enabled network interfaces can create VFs representing virtual network interfaces. This allows a host to assign VFs to guest VMs and configure the offloading of packet switching with the PF, minimizing network virtualization overhead.
However, current SR-IOV utilization is limited because the controllability of SR-IOV is not exposed to guests. We propose emulating SR-IOV on QEMU and integrating it with vDPA to grant guests control over SR-IOV while offloading the data path.
To showcase the effectiveness of this approach, we'll present a detailed performance benchmark using a PoC that offloads network containerazation on the guest. We'll also introduce a design for SR-IOV emulation that provides packet-switching configurability, further motivating its adoption.
Next, we describe the current development status of SR-IOV emulation on QEMU. QEMU already includes some SR-IOV device implementations, but they are based on physical designs, limiting flexibility, and lack datapath offloading. We're addressing this by developing an SR-IOV feature for virtio-net devices, which is fully configurable and enables integration with vDPA. While we leverage QEMU's existing PCI multifunction mechanism to support configuration flexibility, SR-IOV emulation presents unique implementation challenges that we'll discuss as well. The new SR-IOV feature in virtio-net will be valuable for immediate testing and serve as a foundation for the future development of practical SR-IOV designs.
As discussed in KVM Forum 2022, there are many good reasons why you might want to run your storage backends outside of the QEMU process that runs your VM, and the obvious answer to this is qemu-storage-daemon. But while naming a tool is an answer, it's not a full answer: QSD provides a variety of different export types – and more may be coming – that allow connecting it to the VM, and each has different performance characteristics and limitations.
In this talk, Kevin will compare the options we have, illustrated by the case of providing QSD-backed storage to Kubernetes and KubeVirt, and explore ideas for future directions and optimisations, such as adding QSD support for ublk and extending ublk on the kernel side to introduce a fast path for the common case.
Most of the considerations will apply to potential other storage daemons as well.
vfio-platform driver and QEMU integration were introduced in 2015.
Since then not much has been contributed upstream in terms of device
integration. For instance the last kernel reset module was contributed
in 2017 emphasizing the lack of in-kernel device growth. It is known
vfio-platform is used, sometimes for evil motivations such as obfuscation,
but the infrastructure is not really used for the original intent it
was contributed for. To help things evolving, this talk aims at
presenting the steps to be carried out at kernel and QEMU level
to enable safe passthrough of a DMA capable platform device.
Kernel reset modules and device tree node generation in QEMU will be
covered. Examples will be presented based on already integrated
devices and other candidate devices. This should help attendees
to identify or design devices that can be easily integrated
and understand showstoppers with regard to resource dependencies.
Guests with multiple vCPUs are commonplace and can submit I/O requests from any vCPU. While virtio-blk supports exposing multiple queues to the guest, QEMU processed all queues in a single thread until recently.
This talk introduces the virtio-blk IOThread Virtqueue Mapping feature added in QEMU 9.0. This feature improves scalability by processing queues in a user-configurable number of threads. Removing the single threaded bottleneck narrows the performance gap between bare metal and virtualization.
Benchmark results are presented to quantify the impact on performance. Configuration topics like choosing the number of threads are discussed. Finally, open issues and future support in virtio-scsi and other devices types are also covered.
Compute Express Link (CXL) is an open standard interconnect built upon industrial PCI layers to enhance the performance and efficiency of data centers by enabling high-speed, low-latency communication between CPUs and various types of devices such as accelerators and memory. It supports three key protocols: CXL.io as the control protocol, CXL.cache as the host-device cache-coherent protocol, and CXL.mem as load store memory access protocol. CXL Type 2 devices leverage the three protocols to seamlessly integrate with host CPUs, providing a unified and efficient interface for high-speed data transfer and memory sharing. This integration is crucial for heterogeneous computing environments where accelerators, such as GPUs, and other specialized processors, handle intensive workloads.
VFIO is the standard interface used by Linux kernel to pass a host device, such as a PCI device, to a virtual machine (VM). To pass a PCI device to a VM, VFIO provides several modules, including vfio-pci (the generic PCI stub driver), VFIO variant drivers (vendor-specific PCI stub drivers), and vfio-pci-core (the core functions needed by vfio-pci and other VFIO variant drivers). With the VFIO UABIs, user space device model like QEMU can map the device registers and memory regions into the VM, allowing the VM to directly access the device. With a VFIO variant driver from HW vendors, it can also support mediate passthrough, live migration for use cases like vGPU. Although CXL is built upon the PCI layers, passing a CXL type-2 device can be different than PCI device due to CXL specifications, e.g. emulating CXL DVSECs, handling CXL-defined register regions in the BAR, exposing CXL HDM regions. Thus, a new set of VFIO CXL modules needs to be introduced.
In this topic, we review the requirements of a CXL type-2 device, discuss the architecture design of VFIO CXL modules, their UABIs, and the required changes to the kernel CXL core and QEMU besides VFIO.
Cloud computing, with its flexible resource allocation and large-scale data storage, provides an integrated underlying platform for the widespread application of AI, including large-scale model training and inference. However, being different from traditional applications, AI focuses more on heterogeneous computing, and building it on virtualization brings some new issues and challenges, including:
1. The PCIe P2P communication efficiency between GPUs or GPUs and RDMA NICs is crucial for large-scale model training and inference. However, in virtualization scenarios, there will be a serious performance degradation due to the enablement of IOMMU.
2. Various higher-precision (millisecond-level) monitoring agents are usually deployed in VMs to monitor something like PCIe bandwidth, network bandwidth, etc. We found that traditional PMU virtualization cannot fully meet such monitoring needs. And those monitoring agents can also result in a high number of VMEXITs due to frequent PIO and RDPMC operations.
To address these challenges, this topic proposes a set of solutions, such as avoiding P2P TLPs being redirected to IOMMU and passthroughing core and uncore PMUs to the guest, to bridge the gap on AI infrastructure between virtualized and bare-metal environments.
We give a multifaceted insight into what’s going on with virtio-fs, from the current state and future prospects of live migration support, where we have made considerable progress, over experimental areas, to a look at performance.
Some experimental areas are the support for non-vhost-user interfaces, such as /dev/fuse and vDPA/VDUSE, and to go beyond our simple passthrough driver, both via filesystem “transformation” functionality (e.g. UID/GID mapping) and by including native drivers such as network filesystem drivers.
As for virtio-fs’s performance, we’re going to have a look at both the interface, specifically multiqueue support, and virtiofsd’s internal architecture.
For SEV SNP live migration support, a migration helper would run as a mirror VM. The mirror VM would use the existing KVM API's to copy the KVM context and populate the NPT page tables at page fault time. The mirror VM also does the dirty page tracking and finalizes the end of live migration. For designing the guest_memfd API's for the mirror VM, we want to consider the post copy use case as well so that the copying of paged-in memory in the mirror VM would have a separate memory view. In this talk we will cover the above use-cases for guest_memfd & mirror VM design for the SEV-SNP live migration.
This talk presents the current status and ongoing efforts to implement VirtIO GPU for infotainment systems in the automotive industry. We will highlight our decision to develop VirtIO GPU in Rust as a vhost-user device under the Rust-VMM project umbrella.
Implementing VirtIO for hardware enables the deployment of Android on various VMMs that support VirtIO, such as Crosvm and QEMU. This approach offers benefits like reducing the attack surface of QEMU and providing more granularity in setting up permissions for the device process. Our VirtIO GPU implementation manages GPU device operations using the vhost-user protocol. Currently, we support virglrenderer, and we are exploring the integration of gfxstream to allow the use of either of the two rendering component for graphics rendering and processing.
During this presentation, I will share our journey in building the VirtIO GPU device in Rust, including the challenges faced and the milestones achieved. I will shed more light on the past, present and future status.
VSM is a virtualization-based security technology introduced by Microsoft that leverages the hypervisor's higher trust base to protect guest data against compromises. It introduces primitives that allow monitoring the guest's execution state from a higher privilege context, as well as enforcing memory access limitations beyond the guest's page tables.
At the KVM Forum 2023, we introduced VSM and the challenges we faced in emulating it in KVM. We have made significant progress since then, and more importantly, we settled on an innovative design based on the concept of sharing multiple KVM VMs within a single QEMU VM. We call these “Companion VMs.” In this talk, we will revisit the core VSM concepts and delve into how we managed to model VSM's privileged execution contexts as distinct KVM VMs. Additionally, we will discuss how this approach could be utilized in the context of confidential computing (SEV SNP VMPLs) or to enhance device emulation security by moving it into the guest context. Ultimately, we will provide an update on our efforts to upstream our work in both KVM and QEMU.
Among all the other virtio devices, virtio-gpu stands out due to its versatility. On the surface, it's a device that provides a paravirtualized GPU and display controller. But thanks to the powerful combination of its three main primitives (a virtqueue transport, shared memory and fences) it's today able to support multiple, specialized personalities to cover different use cases, enabling graphics acceleration at different levels (from native DRM to GL abstraction) and offloading compute tasks from the guest to the host's GPU.
In this talk I'll detail current and future virtio-gpu capabilities, their implementation and intended use cases, and how you can take advantage of them from different software stacks. If time permits, I'll also demonstrate one of its lesser known capabilities.
While almost all VM operating systems support interrupt and exception handling, some operating system may have certain built-in assumptions about interrupt behavior based on bare-metal hardware. A malicious hypervisor can break down these assumptions and put guest drivers or guest OS kernels into an unexpected state which could lead to a security issue.
To address this concern, SEV-SNP supports features to protect the guest against malicious injection attacks. The preferred method is Restricted Injection, but this was rejected by upstream. This talk introduces another approach, the Alternate Injection feature of SEV-SNP, which will use Secure VM Service Module (SVSM), and APIC emulation in the SVSM to secure interrupt delivery into an SEV-SNP guest.
I’ll be presenting the draft of virtio-video device specification, talking about the challenges we’re facing, and hoping to get your feedback on what’s needed to move toward standardization.
In this presentation, we will share our experience of developing the KVM backend for VirtualBox. It allows VirtualBox to use KVM as a hypervisor and makes the VirtualBox third-party kernel modules unnecessary.
VirtualBox is a vast C++ codebase that implements a full virtualization solution in a cathedral style. It consists of a tightly integrated kernel and userspace part with lots of flexibility to execute code in kernel or userspace depending on the situation. Both components are highly portable across operating systems. This unique architecture predates KVM and is very different from how Qemu interacts with KVM.
Because shipping a third-party hypervisor is more and more problematic on Windows and MacOS, VirtualBox has introduced a new internal abstraction, the Native Execution Manager (NEM). NEM allows using the native virtualization API of the operating system. There are unfinished and experimental NEM backends in the VirtualBox code base for Hyper-V, the Apple Hypervisor Framework and KVM.
Starting from the incomplete KVM backend already present in the VirtualBox code base, we gradually turned it into a fully-featured and stable backend ready for day-to-day use. We will discuss the main challenges we faced in this journey. We will mostly focus on the following two topics:
- Integrating VirtualBox with KVM’s IRQCHIP abstraction to leverage advanced interrupt virtualization features (something that vanilla VirtualBox cannot do),
- Enabling nested virtualization for VirtualBox and the challenges we faced around the KVM API.
As we previously worked extensively on custom hypervisors, we also want to share our constructive thoughts on the KVM API, highlighting its successes, complexities and maybe even starting a discussion on how to simplify it.
The mainline KVM currently does not support the virtualization of Arm’s TrustZone. This means virtual machines (VMs) running on KVM cannot leverage TrustZone to run a trusted execution environment (TEE), such as OP-TEE. To address this limitation, we have extended KVM to expose a virtual TrustZone to VMs. To virtualize TrustZone's CPU features, we multiplex the virtual EL3 and secure EL1 on the normal world EL1 on the hardware. We adopt trap-and-emulate to handle sensitive instructions executed in the virtual TrustZone in KVM. Additionally, we build on the current TrustZone hardware abstraction in QEMU by creating a memory region representing virtual secure memory and mapping secure IO onto it. Our KVM prototype supports booting a paravirtualized OP-TEE. We plan to open-source our implementation to the community. As a next step, we will explore exposing TrustZone to confidential VMs based on pKVM and Arm CCA and extend QEMU to virtualize secure IO devices, such as TZPC.
Two recent papers about serverless confidential computing have identified key overheads when booting SEV and SNP guests with OVMF. Are these claims well-founded? This talk will show how to benchmark OVMF while avoiding common pitfalls and identify overhead introduced when confidential computing is enabled. Furthermore the talk will unravel whether overhead is the result of hardware requirements, firmware design, or implementation error. Will alternate firmware layouts and boot schemes (e.g. IGVM and the SVSM) ameliorate these issues or make them worse?
Starting from Windows 11 version 24H2, the Windows Hypervisor Platform APIs are available in preview form on Arm devices to enable usage of third party VMMs.
This presentation will also cover the device extensibility support provided by Hyper-V for out of process PCIe devices with leveraging the Hyper-V VMM, and how this allows using Qemu's device emulation logic when still using the Hyper-V VMM included in Windows (vmwp).
The closing session for KVM Forum 2024.