PyCon DE & PyData 2026

Filip Bacic

Software Development Manager at ReversingLabs, leading teams responsible for large-scale data processing, data quality, and technical writing. Specialized in turning complex systems into something that works, produces correct results, and is documented well enough that someone else can understand it, usually in that order.


Session

04-14
16:30
30min
How to Search Through 800 Billion Records in Real Time
Mirano Tuk, Filip Bacic

Large-scale distributed systems rarely produce clean data streams. In practice, hundreds of services continuously emit overlapping updates, retries, corrections, and partial state. Turning that constant stream of noisy events into a reliable, searchable dataset in real time, while processing hundreds of billions of records per day, requires careful architectural choices.

This talk shares practical lessons from building a Kafka-based ETL pipeline that transforms massive volumes of events into a coherent dataset suitable for real-time search. After a brief overview of the system architecture, we focus on several key techniques: reducing redundant processing through key deduplication and short-lived buffers, defining when messages can be safely acknowledged without risking data loss, and keeping long-running ETL services healthy under heavy Kafka workloads.

The session emphasizes concrete engineering trade-offs and operational realities rather than theory. Attendees will leave with practical patterns for building more reliable and efficient streaming pipelines.

PyData: Data Handling & Data Engineering
Ferrum [2nd Floor]