Efficiently Deploying and Benchmarking LLMs in Kubernetes Devconf.US

Efficiently Deploying and Benchmarking LLMs in Kubernetes
.ical

2024-08-14 14:20–14:55, Conference Auditorium (capacity 260)

As LLMs gain mainstream adoption by businesses, operating them efficiently on Kubernetes is becoming an important area of concern. One aspect of ensuring the optimal performance of running LLM services is to first reliably measure the key runtime performance metrics for LLMs. In this talk, we will demonstrate how to performance benchmark LLMs on Kubernetes with the Kserve stack under various inference runtimes. We will demonstrate LLM deployment strategies, load testing across various configs, and techniques for capturing the key performance indicators such as tokens per second, time per output token, time to first token, and so forth. We will also show how to capture relevant resource consumption metrics such as GPU utilization and GPU memory consumption to aid in performance bottleneck analysis. The runtime performance metrics coupled with the evaluation metrics for LLMs can be an extremely useful tool in optimizing the performance of running LLM services in a production environment.

Nikhil Palaskar

Nikhil is a Senior Software Engineer and a member of Performance and Scale Engineering at Red Hat. Nikhil's current focus is primarily on serving LLMs and performance optimizing their deployment in Kubernetes/Openshift environments.

Efficiently Deploying and Benchmarking LLMs in Kubernetes .ical 2024-08-14 14:20–14:55, Conference Auditorium (capacity 260)

Efficiently Deploying and Benchmarking LLMs in Kubernetes
.ical

2024-08-14 14:20–14:55, Conference Auditorium (capacity 260)