, Titanium [2nd Floor]
In the daily work of a data engineer, building new data pipelines often takes priority, while maintaining them and ensuring their correctness becomes an afterthought. This focus can quickly turn into a pitfall: failures go undetected, incorrect data silently propagates, and complaints from stakeholders arrive before engineers notice any issues. In practice, incorporating observability into every new data pipeline helps avoid these problems and enables teams to steadily increase system complexity while maintaining trust and peace of mind.
In this talk, I introduce observability in the context of data pipelines, covering its three core pillars: metrics, alarms, and logs. We will explore concepts like the four golden signals, alarm fatigue and structured logging and how they apply to data pipelines. I will show easy to implement first steps and share real-world experiences, where improved observability helped uncover previously unknown incorrect behavior and build trust in data systems.
This talk is well suited for data engineers that had little exposure to observability and want to learn about strategies how to keep sane while managing a jungle of pipelines.
This talk explores how observability can be applied to data pipelines to improve reliability, data quality, and confidence in complex data systems.
The talk begins with an introduction to observability in the context of data engineering. It explains the three core pillars: metrics, alarms, and logs, and discusses why observability is particularly important for data pipelines, where failures are often silent and correctness issues may only surface through stakeholder complaints.
The first section focuses on metrics. It demonstrates how straightforward it can be to instrument data pipelines with basic metrics using Python. The talk then discusses which metrics are worth monitoring, adapting established concepts such as the four golden signals to data engineering use cases. A concrete example based on a near–real-time event processing pipeline illustrates how fine-grained metrics can reveal systematic failures for specific event types.
The second section focuses on alerting. It addresses the challenge that engineers rarely have time to continuously inspect dashboards and therefore rely on alarms to surface important issues. The talk outlines what makes a good alarm, emphasizing that alarms should be actionable, reliable, and provide sufficient context for investigation. A scenario with excessive and noisy alarms is used to illustrate alarm fatigue and a strategy how to get out of such a situation is described.
The final section covers log messages and their importance to reason about how a pipeline ended up in a specific state. It discusses why logs are often difficult to work with in data pipelines, as they may contain a mixture of critical errors, informational messages, and low-level framework output. The talk introduces structured logging as a way to add context and make logs easier to search, filter, and aggregate. Examples include monitoring the distribution of log levels to uncover hidden issues and using centralized logging to identify dependencies between pipelines that are otherwise hard to detect.
The talk concludes by emphasizing how the three pillars of observability build trust in a data pipeline.
Stefan is a data engineer and works at Covestro in a newly established data office. He has four years of experience working on a variety of data platforms, ranging from classic ETL pipelines and data warehousing to near–real-time stream processing. Before moving into data engineering, he completed a PhD in physics, where he felt in love with Python and working with data. Since then he is always curious to learn new things and share what he has learned with others.