Kevin Postlethwait Devconf.US

Kevin Postlethwait
.ical

Session

Scale your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler

Whether you want to run AI model distributed training, or big data processing on Kubernetes, chances are you’ll face some challenges when scaling your workloads, like resource fragmentation, lack of all-or-nothing semantics for quota management and auto-scaling, low throughput, and limited priority and preemption management. The Kubernetes scheduler has historically been designed to orchestrate containers of (micro-)services, rather than workloads of highly-coupled, heterogeneous and resource intensive batch processes.
There has recently been a Cambrian explosion of projects in the Kubernetes ecosystem that have innovated to solve these challenges such as Karmada, Koordinator, Kueue, MCAD, Volcano and YuniKorn. In this session, we’ll compare these projects, review their design choices, discuss their pros-and-cons, so you’ll have a better understanding of the landscape, and be able to decide which one best suits your needs when it comes to achieving better utilization of your Kubernetes clusters for your batch workloads.

Cloud, Hybrid Cloud, and Hyperscale Infrastructure

Metcalf Small Ballroom (capacity 100)

Kevin Postlethwait .ical

Session

Kevin Postlethwait
.ical