PyCon Lithuania 2024

Speed up open source LLM-serving with llama-cpp-python
2024-04-05 , Room 111

Large language models (LLMs) often require huge compute resources to serve. This is a common challenge for those who want to avoid sharing their data with cloud API providers, or to deploy their stack in air-gapped environments. We will take a look at how the open source llama-cpp-python library opens the door to lower hardware requirements and simplifies deployment significantly.


This talk takes a deep look at the problem of serving Large language models (LLMs) without massive hardware requirements and complexity. Often LLMs require 10s or even 100s of GBs, and serving these models as-is may require complex multi-GPU systems. We will take a look at how the open source llama-cpp-python library 1)reduces the hardware memory requirement often by 8x using quantization methods, 2)simplifies the deployment complexity with fewer components and dependencies, 3)can be used by people like you and me!

I lead applied research efforts for our EMEA team to scale AI for production loads at Clarifai. My team has been solving search/ranking, retrieval, and multimodal problems. Previously I was leading the effort in developing custom ML solutions for enterprise customers.

This speaker also appears in: