Cross-Platform Data Lineage with OpenLineage
2022-06-13 , Kesselhaus

There are more data tools available than ever before, and it's easier to build a pipeline than it's ever been. This has resulted in an explosion of innovation, but it also means that data within today's organizations has become increasingly distributed. It can't be contained within a single brain, a single team, or a single platform.

Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark, Flink, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time.

In this session, Julien Le Dem from Datakin will show how to trace data lineage across Apache Spark and Apache Airflow. He will walk through the OpenLineage architecture and provide a live demo of a running pipeline with real-time data lineage.


Get your ticket now!

Register for Berlin Buzzwords in our ticket shop! We also have online tickets and reduced tickets for students available and you can find more information about our Diversity Ticket Initiative here!

Julien Le Dem is the Chief Architect of Astronomer and Co-Founder of Datakin. He co-created Apache Parquet and is involved in several open source projects including OpenLineage, Marquez (LFAI&Data), Apache Arrow, and Apache Iceberg. Previously, he was a senior principal at Wework; principal architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.