Fight your garbage data: implementation of a pythonic data quality monitoring framework in PySpark PyCon DE & PyData 2026

Fight your garbage data: implementation of a pythonic data quality monitoring framework in PySpark
.ical
2026-04-14 12:25, Helium [3rd Floor]

The timeless phrase “garbage in, garbage out” is even more important today with the growing usage of non-deterministic generative neuronal networks, which amplifies the effect of bad data quality. This presentation describes Data Quality Monitor — a tool to bring transparency into data quality and help drive real improvements.

In the talk, we'll cover what defines a successful data quality monitoring solution and share findings from our initial evaluation of available open-source frameworks. Next, we'll showcase our implementation based on DQX. DQX is a lightweight, open-source framework for performing row-level data quality checks programmatically, with business rules organized in manageable YAML files. DQX, originally developed by Databricks Labs, integrates seamlessly with PySpark, making it easy and affordable to run data quality checks within our IoT data lake. Finally, we will discuss the organizational processes and structures required to effectively respond to data quality issues.

In the talk we share our expirience from the project implemented in Q3 2025. We start with the motivation for the project, involved stakeholders and their needs. We will then define the criteria for a successful data quality monitoring solution and share findings from our evaluation of existing frameworks. We will also discuss why popular frameworks like Great Expectations or SODA did not meet our requirements.

Next, we will demonstrate our implementation based on DQX—a lightweight, open-source Python library designed for traceable, row-level data quality checks before and after data is persisted. DQX, developed and maintained by Databricks labs, allows developers to concentrate on the core implementation while providing business users YAML files for maintenance of business rules. Furthermore, DQX’s seamless integration with PySpark enables efficient and cost-effective quality monitoring within our IoT data lake.

Finally, we move beyond the code to the organisational reality. We will discuss how we embedded Data Quality Monitor into the organisation and share our opinion on the hard questions: who is responsible for maintaining rules? who monitors the results?

Talk outline

Motivation for the project
Initial situation and objectives
Framework evaluation
Evaluation criteria for a successful data quality monitoring
Comparison of available frameworks
Our implementation with DQX
How to use built-in data quality checks
How to add custom data quality checks
Automated rule generation with DQX Profiler
Output and visualisation options
Python project structure
Embedding in organisation
Rule maintenance
How to communicate data quality issues
Summary

Key takeaways

Understanding of most important criteria when choosing the framework for data quality monitoring from perspective of a data engineer and an architect
Understanding of DQX framework
Ideas how to integrate data quality monitoring into organisations

Expected audience expertise in your talk's domain:: Intermediate Expected audience expertise in Python:: Intermediate Public link to supporting material, e.g. videos, Github::

https://github.com/databrickslabs/dqx

Rostislaw Krassow

Rostislaw, a data architect at RATIONAL AG, specializes in distributed databases, the Apache Hadoop ecosystem and Azure cloud. He leverages his expertise to maintain the enterprise Data & Analytics platform for IoT data, where his daily work involves reconciling diverse stakeholder perspectives to deliver sustainable solutions.

Joshua Finger

Joshua is a Data Engineer at inovex GmbH dedicated to building robust, scalable data products. Utilizing his foundation as a Full Stack Software Engineer, he applies rigorous software engineering principles to ensure every data solution is high-quality, maintainable, and efficient.

Fight your garbage data: implementation of a pythonic data quality monitoring framework in PySpark .ical 2026-04-14 12:25, Helium [3rd Floor]

Fight your garbage data: implementation of a pythonic data quality monitoring framework in PySpark
.ical
2026-04-14 12:25, Helium [3rd Floor]