KVM Forum 2024

Rohit Kumar

System Engineer @ Nutanix


Session

09-22
15:15
30min
UserfaultFD-based Memory Overcommitment
Tejus GK, Manish Mishra, Rohit Kumar

Linux virtualization environments support memory overcommitment for VMs using techniques such as host-based swapping and ballooning. Ballooning is not a complete solution, and we have observed significant performance bottlenecks with the native Linux swap system. Swapping also degrades live migration performance, since QEMU reads a VM’s entire address space, including swapped-out pages that must be faulted in to migrate their data. QEMU accesses to pages during live migration also pollute the active working set of the VM process, causing unnecessary thrashing. As a result, both guest performance and live migration times can be severely impacted by native Linux memory overcommitment.
These problems motivated us to develop a custom memory manager (external to QEMU) for VM memory. We propose leveraging UserfaultFD to take full control of the VM memory space via an external memory manager process, exposed to QEMU as a new memory backend. QEMU requests memory from this external service and registers the userfaultFD of shared memory address spaces with the memory manager process. This approach allows us to implement a lightweight swap system that can take advantage of a multi-level hierarchy of swap devices with different latencies that can be leveraged to improve performance. More generally, gaining control over guest memory enables a wide range of additional optimizations as future work.
This approach also offers significant opportunities to improve live migration. With full visibility into the swap state of guest physical memory, we can avoid costly accesses to swapped-out pages, skipping over them during live migration. By using shared remote storage accessible to both the source and destination hosts, we transfer only their swap locations, instead of their page contents. This eliminates the page faults associated with swapped-out pages, and also reduces pollution of the guest's active working set.
We will present the design and implementation of our prototype userfaultFD-based memory overcommitment system, and explain how it interoperates with QEMU for effective VM memory management. We will also demonstrate its improved performance on several VM workloads, and discuss various tradeoffs and areas for future improvement.

Hall A+B