06-13, 11:50–12:30 (Europe/Berlin), Kesselhaus
There are more data tools available than ever before, and it's easier to build a pipeline than it's ever been. This has resulted in an explosion of innovation, but it also means that data within today's organizations has become increasingly distributed. It can't be contained within a single brain, a single team, or a single platform.
Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark, Flink, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time.
In this session, Julien Le Dem from Datakin will show how to trace data lineage across Apache Spark and Apache Airflow. He will walk through the OpenLineage architecture and provide a live demo of a running pipeline with real-time data lineage.
Julien Le Dem is the CTO and Co-Founder of Datakin. He co-created Apache Parquet and is involved in several open source projects including OpenLineage, Marquez (LFAI&Data), Apache Arrow, Apache Iceberg and a few others. Previously, he was a senior principal at Wework; principal architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a tech lead on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.