Safely Batching Tokenization Merges PyConES 2025

Safely Batching Tokenization Merges
.ical
19/10/2025 16:40–17:20, Track 05 - E05, A02
Idioma: English

A batched approach to building tokenization vocabularies safely achieves a 2-3 order of magnitude speed improvement, depending on the target vocabulary size. This safe batching makes it possible to process billions of tokens and generate new token vocabularies in minutes on a basic laptop without changing the end tokenization result.
When building a tokenization vocabulary for an LLM or a compression algorithm, the standard approach is to count all consecutive token pairs, merge the most common pair into a new token, then repeat the process until you reach the desired vocabulary size. With a large dataset, that is an enormous amount of work for a single token merge. I outline the three key insights that let you safely process larger and larger batches of token merges.
Building a tokenization vocabulary is not typically done very often. However, my open source and pure-python solution aims to make it easier for anyone to try out new tokenization ideas. The tokenization step of LLM training is often derided as an annoyance that AI researchers only put up with because no other data representation works as well. Because of this, there are still lots of overlooked "easy wins" in this foundational step to LLM training. I conclude my talk by showing how a batched approach to tokenization vocabulary building can be combined with other tokenization research and reduced training runs to empirically improve LLM performance through tokenization changes alone.

Temática: Machine Learning e Inteligencia Artificial (ML, deep learning, ética en IA, modelos generativos...) Temáticas adicionales: Ciencia de Datos e Ingeniería de Datos (análisis, visualización, pipelines, data engineering, notebooks...) Nivel de la propuesta: Básica (no hacen falta conocimientos previos)

Alexander Morgan

Alexander Morgan started out with python by writing music analysis software for academic projects. He lives in Barcelona and works at Datamaran as a python and postgres developer. He is increasingly interested in using python to validate algorithms and ideas, as well as simplified approaches to web development.

Safely Batching Tokenization Merges .ical 19/10/2025 16:40–17:20, Track 05 - E05, A02 Idioma: English

Safely Batching Tokenization Merges
.ical
19/10/2025 16:40–17:20, Track 05 - E05, A02
Idioma: English