Safely Batching Tokenization Merges
19/10/2025 , Track 02 - E04, A01
Idioma: English

A batched approach to building tokenization vocabularies safely achieves a 2-3 order of magnitude speed improvement, depending on the target vocabulary size. This safe batching makes it possible to process billions of tokens and generate new token vocabularies in minutes on a basic laptop without changing the end tokenization result.
When building a tokenization vocabulary for an LLM or a compression algorithm, the standard approach is to count all consecutive token pairs, merge the most common pair into a new token, then repeat the process until you reach the desired vocabulary size. With a large dataset, that is an enormous amount of work for a single token merge. I outline the three key insights that let you safely process larger and larger batches of token merges.
Building a tokenization vocabulary is not typically done very often. However, my open source and pure-python solution aims to make it easier for anyone to try out new tokenization ideas. The tokenization step of LLM training is often derided as an annoyance that AI researchers only put up with because no other data representation works as well. Because of this, there are still lots of overlooked "easy wins" in this foundational step to LLM training. I conclude my talk by showing how a batched approach to tokenization vocabulary building can be combined with other tokenization research and reduced training runs to empirically improve LLM performance through tokenization changes alone.


Temática:

Machine Learning and Artificial Intelligence (ML, deep learning, AI ethics, generative models...)

Temáticas adicionales:

Data Science and Data Engineering (analytics, visualization, pipelines, data engineering, notebooks...)

Nivel de la propuesta:

Basic (no previous knowledge is necessary)

Alexander Morgan started out with python by writing music analysis software for academic projects. He lives in Barcelona and works at Datamaran as a python and postgres developer. He is increasingly interested in using python to validate algorithms and ideas, as well as simplified approaches to web development.