Building data pipelines in Python: Airflow vs scripts soup
2019-09-02, 14:00–15:30, Track4 (Chillida)

In this workshop, you will learn how to migrate from ‘scripts soups’ (a set of scripts that should be run in a particular order) to robust, reproducible and easy-to-schedule data pipelines in Airflow.


Introduction (5 minutes)

Format: presentation
Go over the agenda
List the relevant resources
Make sure everyone has followed the installation instructions

Intro to data pipelines

Format: presentation
Go over the components of traditional data science pipelines
Presentation of the scripts soup anttipatern

Creating a script soup

Format: hands-on
The attendees will perform an ETL task on some data using a set of independent scripts.
In this exercise, I will provide and explain the code and explain what we are trying to achieve with this pseudo-pipeline. The attendees will have a chance to try and reproduce it themselves.

Introduction to Airflow and DAGS

Format: presentation
Introduce the concept of DAGs (directed acyclic graphs)
Present and introduce the components of Airflow
Airflow documentation

Set up a local instance of Airflow

Format: hands-on
The attendees will create a local instance of Airflow and explore the sample DAGS provided.
They will be introduced to the scheduling capabilities of the tool and track the status of the pipelines using the web GUI.

ETL task on Airflow

Format: hands-on
I will provide hints on how to transform the scripts soup into Airflow DAGS.
For this, I will use the pseudo code and other pedagogical approaches inspired by the software carpentry lessons to direct the attendees to the deployment of their first DAG in Airflow.

Wrap up and questions

Format: Q&A

Setup

https://opendata-airflow-tutorial.readthedocs.io/en/latest/setup.html


Domains – Jupyter, Scientific data flow and persistence Project Homepage / Git Domain Expertise – some Python Skill Level – professional Abstract as a tweet – learn how to automate and level up data pipelines with Airflow