PyCon DE & PyData 2026

Fight your garbage data: implementation of a pythonic data quality monitoring framework in PySpark
, Helium [3rd Floor]

The timeless phrase “garbage in, garbage out” is even more important today with the growing usage of non-deterministic generative neuronal networks, which amplifies the effect of bad data quality. This presentation describes Data Quality Monitor — a tool to bring transparency into data quality and help drive real improvements.

In the talk, we'll cover what defines a successful data quality monitoring solution and share findings from our initial evaluation of available open-source frameworks. Next, we'll showcase our implementation based on DQX. DQX is a lightweight, open-source framework for performing row-level data quality checks programmatically, with business rules organized in manageable YAML files. DQX, originally developed by Databricks Labs, integrates seamlessly with PySpark, making it easy and affordable to run data quality checks within our IoT data lake. Finally, we will discuss the organizational processes and structures required to effectively respond to data quality issues.


In the talk we share our expirience from the project implemented in Q3 2025. We start with the motivation for the project, involved stakeholders and their needs. We will then define the criteria for a successful data quality monitoring solution and share findings from our evaluation of existing frameworks. We will also discuss why popular frameworks like Great Expectations or SODA did not meet our requirements.

Next, we will demonstrate our implementation based on DQX—a lightweight, open-source Python library designed for traceable, row-level data quality checks before and after data is persisted. DQX, developed and maintained by Databricks labs, allows developers to concentrate on the core implementation while providing business users YAML files for maintenance of business rules. Furthermore, DQX’s seamless integration with PySpark enables efficient and cost-effective quality monitoring within our IoT data lake.

Finally, we move beyond the code to the organisational reality. We will discuss how we embedded Data Quality Monitor into the organisation and share our opinion on the hard questions: who is responsible for maintaining rules? who monitors the results?

Talk outline

  • Motivation for the project

  • Initial situation and objectives

  • Framework evaluation

  • Evaluation criteria for a successful data quality monitoring

  • Comparison of available frameworks

  • Our implementation with DQX

  • How to use built-in data quality checks

  • How to add custom data quality checks

  • Automated rule generation with DQX Profiler

  • Output and visualisation options

  • Python project structure

  • Embedding in organisation

  • Rule maintenance

  • How to communicate data quality issues

  • Summary

Key takeaways

  • Understanding of most important criteria when choosing the framework for data quality monitoring from perspective of a data engineer and an architect

  • Understanding of DQX framework

  • Ideas how to integrate data quality monitoring into organisations


Expected audience expertise in your talk's domain:: Intermediate Expected audience expertise in Python:: Intermediate Public link to supporting material, e.g. videos, Github::

https://github.com/databrickslabs/dqx

Rostislaw, a data architect at RATIONAL AG, specializes in distributed databases, the Apache Hadoop ecosystem and Azure cloud. He leverages his expertise to maintain the enterprise Data & Analytics platform for IoT data, where his daily work involves reconciling diverse stakeholder perspectives to deliver sustainable solutions.

Joshua is a Data Engineer at inovex GmbH dedicated to building robust, scalable data products. Utilizing his foundation as a Full Stack Software Engineer, he applies rigorous software engineering principles to ensure every data solution is high-quality, maintainable, and efficient.