PyCon DE & PyData 2025

LLM Inference Arithmetics: the Theory behind Model Serving
2025-04-23 , Titanium3

Have you ever asked yourself how parameters for an LLM are counted, or wondered why Gemma 2B is actually closer to a 3B model? You have no clue about what a KV-Cache is? (And, before you ask: no, it's not a Redis fork.) Do you want to find out how much GPU VRAM you need to run your model smoothly?

If your answer to any of these questions was "yes", or you have another doubt about inference with LLMs - such as batching, or time-to-first-token - this talk is for you. Well, except for the Redis part.


The talk will cover the theory necessary to understand how to serve LLMs. The talk covers the math behind transformers inference in an accessible and light way. By the end of the talk, attendants will learn:

  1. How to count the parameters in an LLM, especially the ones in the attention layers.
  2. The difference between compute and memory in the context of LLM inference.
  3. That LLM inference is made up of two parts: prefill and decoding.
  4. What is an LLM server, and what features they implement to optimise GPU memory usage and reduce latency
  5. How batching affects your inference metrics, like time-to-first-token.

The talk will cover:

Did you pay attention? (4 min). A short review of the attention mechanism and how to count parameters in a transformer-based model.

Get to know your params (8 min). The math-y section of the talk, explaining how to translate parameter counts into memory and compute requirements.

Prefill and Decoding (8 min) Explains that inference happens in two steps (prefill and decoding) and how KV-cache exploits this to make decoding faster. Common metrics to measure inference performance, like time-to-first-token and token-per-second.

Context and batch size (5 min) Adds to the picture the sequence length, as well as the number of requests to process in parallel. Explains how LLM servers, like vLLM, use techniques like Paged Attention to optimise GPU usage

Conclusion (5 min) Wrap up, Q&A.


Expected audience expertise: Domain:

Novice

Expected audience expertise: Python:

None

AI Engineer at xtream by day, and open source maintainer by night. I strive to be an active part of the Python and PyData communities - e.g. as an organiser of PyData Milan. Feel free to reach out!