2024-08-16 –, Metcalf Small Ballroom (capacity 100)
Whether you want to run AI model distributed training, or big data processing on Kubernetes, chances are you’ll face some challenges when scaling your workloads, like resource fragmentation, lack of all-or-nothing semantics for quota management and auto-scaling, low throughput, and limited priority and preemption management. The Kubernetes scheduler has historically been designed to orchestrate containers of (micro-)services, rather than workloads of highly-coupled, heterogeneous and resource intensive batch processes.
There has recently been a Cambrian explosion of projects in the Kubernetes ecosystem that have innovated to solve these challenges such as Karmada, Koordinator, Kueue, MCAD, Volcano and YuniKorn. In this session, we’ll compare these projects, review their design choices, discuss their pros-and-cons, so you’ll have a better understanding of the landscape, and be able to decide which one best suits your needs when it comes to achieving better utilization of your Kubernetes clusters for your batch workloads.
Anish is an engineering manager at Red Hat in the OpenShift AI organization. He is working on making machine learning easier for the wider community by building a platform out with cloud capabilities at the core. Most recently, his interests have been focused on the Distributed Workloads space and technologies such as KubeRay, Kueue, and CodeFlare. He has previously been invested heavily in areas such as monitoring, scalability, and reliability.