2026-06-06 –, Doddington Forum
Adopting a streaming architecture as a Python developer often means abandoning the tools and abstractions you know: DataFrames, batch processing, familiar data workflows, in favour of an entirely different mental model. After ten years of tackling this problem across multiple companies, I've learned it doesn't have to be that way.
In this talk, I'll show how to treat Kafka not as a stream of individual messages but as a source of micro-batches, and how to deserialize those messages, whether JSON or Protobuf, into Arrow-backed DataFrames. The result: your processing code looks the same whether the data comes from a Parquet file or a Kafka topic.
No heavy framework required. Using confluent-kafka and Apache Arrow, I'll walk through how to build this from the ground up, so you understand every layer of the stack.
The talk opens with a concrete example of stream processing. We have data flowing in, and a clear task to perform on it. No theory, no definitions, just a practical scenario the audience can immediately relate to.
From there, we step back and look at how Kafka works. Topics, consumers, partitions, message formats. Just enough to understand the architecture behind the example, and to appreciate why Kafka has become the standard backbone for streaming systems.
Then comes the friction. When you consume from Kafka, you get one message at a time. Each message is serialized as JSON or Protobuf. If you're a Python developer used to working with DataFrames, this feels like going back to writing for loops over rows. We'll look at what the naive approach looks like in code, and why it quickly becomes painful as processing logic gets more complex.
With the problem clearly felt, we introduce the solution: treating Kafka not as a stream of individual messages but as a source of micro-batches, and deserializing those batches directly into Arrow-backed DataFrames using confluent-kafka and Apache Arrow. The processing code that follows looks identical to what you'd write against a Parquet file. We'll see both versions side by side to make this concrete.
We close with lessons learned from applying this pattern in production over ten years. What breaks, what surprises you, and what trade-offs you should be aware of before adopting this approach in your own systems.
The talk assumes familiarity with Python and basic data processing with DataFrames. No prior knowledge of Kafka or streaming is required.
A seasoned software engineer, working in both batch and real time, data intensive, python application.