2020-07-13 –, Tokyo Meetup, [Sessions start: Tuesday 14.07 1 pm (Monday 13.07 9 pm PDT)]
How do you create fast and painless delivery of new DAGs into production? When running Airflow at scale, it becomes a big challenge to manage the full lifecycle around your pipelines; making sure that DAGs are easy to develop, test, and ship into prod. In this talk, we will cover our suggested approach to building a proper CI/CD cycle that ensures the quality and fast delivery of production pipelines.
CI/CD is the practice of delivering software from dev to prod, optimized for fast iteration and quality control. In the data engineering context, DAGs are just another piece of software that require some form of lifecycle management. Traditionally, DAGs have been thought of as relatively static, but the new wave of analytics and machine learning efforts require more agile DAG development, in line with how agile software engineering teams build and ship code.
In this session, we will dive into the challenges of building CI/CD cycles for Airflow DAGs. We will focus on a pipeline that involves Apache Spark as an extra dimension of real-world complexity, walking through a typical flow of DAG authoring, debugging, and testing, from local to staging to prod environments. We will offer best practices and discuss open-source tools you can use to easily build your own smooth cycle for Airflow CI/CD.
Victor Shafran is cofounder of Databand, an APM and observability solution for data engineering teams. Victor brings 20 years of experience in enterprise software and data product development. In his last position as VP R&D at Equalum, a high growth startup in the data pipelining space, he led a team developing big data infrastructure for Fortune 100 companies. Before that, Victor was Director of Research at SAP and NICE Systems, where he led a team of data scientists on machine learning research. Victor holds an MBA from Tel Aviv University and an M.Sc (cum laude) in computer science.