Scale your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler
Whether you want to run AI model distributed training, or big data processing on Kubernetes, chances are you’ll face some challenges when scaling your workloads, like resource fragmentation, lack of all-or-nothing semantics for quota management and auto-scaling, low throughput, and limited priority and preemption management. The Kubernetes scheduler has historically been designed to orchestrate containers of (micro-)services, rather than workloads of highly-coupled, heterogeneous and resource intensive batch processes.
There has recently been a Cambrian explosion of projects in the Kubernetes ecosystem that have innovated to solve these challenges such as Karmada, Koordinator, Kueue, MCAD, Volcano and YuniKorn. In this session, we’ll compare these projects, review their design choices, discuss their pros-and-cons, so you’ll have a better understanding of the landscape, and be able to decide which one best suits your needs when it comes to achieving better utilization of your Kubernetes clusters for your batch workloads.
Cloud, Hybrid Cloud, and Hyperscale Infrastructure
Metcalf Small Ballroom (capacity 100)