PyCon Lithuania 2024

Write-Audit-Publish Pattern in Modern Data Pipelines
2024-04-05 , Room 203

Data is new oil, and one of the ways is leakage and poisoning the surrounding environment. What happens if you pollute one of the datasets used in some decision makers facing dashboards? In this talk, I will explain the reemergence of the Write-Audit-Publish pattern and how you can achieve it using Apache Iceberg and Apache Spark.


One of the old patterns that is being adopted again is Write-Audit-Publish. Most Data Practitioners are testing the data they create after it’s done, potentially ruining other data assets downstream. Using WAP allows us to ensure that no such thing happens. In this talk, I will present how evolving new tech (in this case, Apache Iceberg table format) allows us to use Apache Spark to use this pattern. The presentation will cover topics like:

  • What is Write-Audit-Publish
  • What is Apache Spark and How it works with Apache Iceberg
  • What is the Branching feature in Apache Iceberg, and how does it connect to WAP
  • Example of its use together with Apache Spark

I'm a Data Engineer with a diverse background, transitioning from a Data Analyst to a Team Lead and Head of Data before returning to my roots. I have a knack for numbers and a passion for coding, constantly seeking optimal solutions and driving continuous improvement.

With expertise in data pipelines, orchestration, SQL, and strong communication skills, I excel in leading and mentoring teams. I've been fortunate to contribute to multiple data migrations and projects, including building some from scratch.

Outside of work, I thrive in fast-paced environments, embracing new challenges and staying updated with the latest technologies through side projects. I share my knowledge with the community through my podcast and blog, 'Uncle Data,' where I discuss all things data-related.

This speaker also appears in: