Surprisal and the headache of tokenizer encodings in LLMs! PyCon Lithuania 2025

Surprisal and the headache of tokenizer encodings in LLMs!
.ical
2025-04-25 12:00–12:25, 228

What can go wrong with tokenizer encodings? Everything! I will share my experience of understanding, misunderstanding, and ultimately learning to work with tokenization in LLMs. I will discuss what surprisal is, its relevance to my research, and its connection to tokenization. The talk will include various examples illustrating how misunderstandings of tokenization can arise, as well as strategies for debugging and preventing these issues.

Siddharth Gupta

Computational Cognitive Science researcher at the University of Potsdam, Potsdam, Germany

This speaker also appears in:

Build & Deploy Apps like a (pro) Data Scientist using Streamlit

Surprisal and the headache of tokenizer encodings in LLMs! .ical 2025-04-25 12:00–12:25, 228

Surprisal and the headache of tokenizer encodings in LLMs!
.ical
2025-04-25 12:00–12:25, 228