PyCon DE & PyData 2026

When LLMs Are Too Big: Building Cost-Efficient High-Throughput ML Systems for E-Commerce Cataloging
, Palladium [2nd Floor]

E-commerce cataloging at idealo operates at extreme scale: 4.5 billion offers from 50,000+ shops across six countries, with peak ingestion rates of 4.8 million offers per minute. While large language models (LLMs) provide strong classification accuracy, they are too slow and costly for billion-scale real-time processing. This talk shows how idealo builds a cost-efficient, high-throughput machine learning system that leverages LLM knowledge without deploying full models in production.

We present how knowledge distillation from a large e5 instruction model enables a compact multilingual MiniLM encoder to achieve high accuracy, and how optimized inference runtimes and specialized hardware such as AWS Neuron help meet strict latency and cost requirements. Beyond modeling, we highlight key operational challenges: constructing training datasets from massively imbalanced data, selecting the right encoder architecture from today’s model landscape, and designing a robust MLOps lifecycle with automated data sampling, training, deployment, and monitoring.

Attendees will learn practical techniques for scaling ML systems under real-world constraints, how to extract value from LLMs when they are too large to serve directly, and how to transition research prototypes into reliable, high-volume production pipelines.


When LLMs Are Too Big: Building Cost-Efficient High-Throughput Machine Learning for Cataloging in E-Commerce

idealo.de offers a price comparison service for over 5.7 million products from a wide variety of over thousands of categories. It navigates a dynamic, constantly changing billion-scale landscape with over 2 billion offers from 50,000+ shops in 6 countries. Our central challenge is cataloging this huge amount of offers automatically at scale, with a peak throughput of processing 4.8 million offers per minute.

While modern large language models (LLMs) excel in such tasks, they do not scale well to huge amounts of data. To fulfill business needs, we need to strike a balance between processing speed and offer cataloging quality. By employing modern machine learning techniques to extract specialist knowledge from downscaled state-of-the-art LLMs and a multitude of performance enhancing techniques we speed up idealo’s processing while massively improving cataloging performance. This talk presents how these solutions find the balance between cost and performance and how they integrate into idealo’s offer cataloging pipelines.

What makes this approach unique?

Our solution and practical experiences in the area of high-throughput classification are presented. This includes the operational aspects of our system, in particular the design of a stable and high-performance MLOps lifecycle integrated into our CI/CD and continuous Training pipelines. Where we automate continuous data sampling, model training, model deployments, and monitoring.

Concrete solutions and best practices are discussed that demonstrate how our model accuracy of the multilingual MiniLM transformer encoder model is improved through knowledge distillation by a large e5 instruction transformer. Additionally, we show how the integration of these models on specialized hardware like AWS Neuron enables strict runtime and latency requirements to be met in a cost-efficient manner.

In detail we will discuss the following topics:

  • Machine Learning Operation Lifecyle for a high-throughput category classification system.
  • Challenges when creating training and testing datasets from the huge amount of existing massively unbalanced data efficiently.
  • Selecting the right model in presence of the current encoder language model zoo.
  • Using knowledge distillation via student-teacher models to balance required compute and classification performance.
  • Integrating quantization techniques for speed improvements.
  • Selecting ideal compute instances for our production environment.
  • How to compile the model on custom designed machine learning accelerators using the neuron package.

Key takeaways for attendees:

  • An overview of months of research and exploration for massive throughput environments including their practical integration in live systems.
  • Modern machine learning systems in production, especially with billion-scale data, need to carefully balance business needs in terms of cost and quality.
  • State-of-the-art LLMs are often not feasible for large-scale tasks. However, new machine learning techniques can extract their knowledge for specific applications.
  • How to transition research findings to production.

The talk will be aligned along our tech stack, which includes PyTorch, PyTorch Lightning, Huggingface, AWS Sagemaker, AWS Neuron SDK, Grafana Loki, Docker and GitHub Actions.


Expected audience expertise in your talk's domain:: Advanced Expected audience expertise in Python:: Intermediate

Tobias Senst is a Senior Machine Learning Engineer at idealo internet GmbH. Tobias Senst received his PhD in 2019 from the Technische Universität Berlin under the supervision of Prof. Thomas Sikora. He has more than 10 years of experience in Computer Vision and Video Analytics research.

At idealo, he switched from the world of images and videos to Natural Language Processing and is responsible for the operation and development of machine learning models in a productive environment.

Bastian is a Senior Machine Learning Research Engineer at idealo Internet GmbH, where he focuses on large-scale offer cataloging and high-throughput machine learning systems. Before joining idealo in 2025, he was an Assistant Professor at Linköping University in Sweden, leading a research group in 3D computer vision.

He completed his PhD in 2020 at Leibniz University Hannover with a thesis on 3D human pose estimation and subsequently spent two years at the University of British Columbia in Canada as a PostDoc, expanding his research into broader areas of 3D computer vision and teaching related courses.