PyCon Hong Kong 2025

PyCon Hong Kong 2025

Demystify vLLM: introducing the de-facto LLM inference engine for private AI
2025-10-11 , Track A (Sessions) (LT-15)
Language: English

With the rise of public cloud services, most people began using LLMs online. However, in August 2025, hundreds of thousands of chat were leaked and surfaced on Google, raising serious concerns about whether it’s still safe to share confidential information with online LLMs. As a result, growing numbers of users are looking to deploy their own models—either in home labs or enterprise environments—to maintain full control.

Ollama has been the most popular self-hosted LLM solution, but it isn’t designed for large-scale deployments. In contrast, vLLM has recently emerged as a de facto standard for high-performance LLM inference serving.

In this talk, we’ll compare vLLM and Ollama and highlight the advantages of vLLM. We’ll also explore the technical inference optimization techniques behind vLLM that reduce GPU compute and memory usage. Finally, we’ll demonstrate how to use and configure vLLM within a Python script.

Peter Ho is a Senior Solution Architect at Red Hat. He has extensive experience working with over 30 enterprise and government customers to establish resilient hybrid and multi-cloud architectures for their mission-critical applications. In recent years, Peter champions LLM serving solutions, helping clients to integrate and deploy vLLM.