PyCon DE & PyData 2026

How to Search Through 800 Billion Records in Real Time
, Ferrum [2nd Floor]

Large-scale distributed systems rarely produce clean data streams. In practice, hundreds of services continuously emit overlapping updates, retries, corrections, and partial state. Turning that constant stream of noisy events into a reliable, searchable dataset in real time, while processing hundreds of billions of records per day, requires careful architectural choices.

This talk shares practical lessons from building a Kafka-based ETL pipeline that transforms massive volumes of events into a coherent dataset suitable for real-time search. After a brief overview of the system architecture, we focus on several key techniques: reducing redundant processing through key deduplication and short-lived buffers, defining when messages can be safely acknowledged without risking data loss, and keeping long-running ETL services healthy under heavy Kafka workloads.

The session emphasizes concrete engineering trade-offs and operational realities rather than theory. Attendees will leave with practical patterns for building more reliable and efficient streaming pipelines.


Large-scale distributed systems rarely produce clean data streams. In practice, hundreds of services continuously emit overlapping updates, retries, corrections, and partial state. Turning that constant stream of noisy events into a reliable, searchable dataset in real time, while processing hundreds of billions of records per day, requires careful architectural choices.

This talk shares practical lessons from building a Kafka-based ETL pipeline that transforms massive volumes of events into a coherent dataset suitable for real-time search. After a brief overview of the system architecture, we focus on several key techniques: reducing redundant processing through key deduplication and short-lived buffers, defining when messages can be safely acknowledged without risking data loss, and keeping long-running ETL services healthy under heavy Kafka workloads.

The session emphasizes concrete engineering trade-offs and operational realities rather than theory. Attendees will leave with practical patterns for building more reliable and efficient streaming pipelines.


Expected audience expertise in your talk's domain:: Intermediate Expected audience expertise in Python:: Intermediate

Principal Software Engineer at ReversingLabs, working on large-scale distributed systems and data-intensive architectures.

I design and operate high-throughput, real-time pipelines, with an emphasis on reliability, observability, and performance in real-world conditions, and a practical approach to engineering trade-offs and system failures.

Software Development Manager at ReversingLabs, leading teams responsible for large-scale data processing, data quality, and technical writing. Specialized in turning complex systems into something that works, produces correct results, and is documented well enough that someone else can understand it, usually in that order.