Building data pipelines in Python: Airflow vs scripts soup
2019-09-02 , Track4 (Chillida)

In this workshop, you will learn how to migrate from ‘scripts soups’ (a set of scripts that should be run in a particular order) to robust, reproducible and easy-to-schedule data pipelines in Airflow.


Introduction (5 minutes)

Format: presentation
Go over the agenda
List the relevant resources
Make sure everyone has followed the installation instructions

Intro to data pipelines

Format: presentation
Go over the components of traditional data science pipelines
Presentation of the scripts soup anttipatern

Creating a script soup

Format: hands-on
The attendees will perform an ETL task on some data using a set of independent scripts.
In this exercise, I will provide and explain the code and explain what we are trying to achieve with this pseudo-pipeline. The attendees will have a chance to try and reproduce it themselves.

Introduction to Airflow and DAGS

Format: presentation
Introduce the concept of DAGs (directed acyclic graphs)
Present and introduce the components of Airflow
Airflow documentation

Set up a local instance of Airflow

Format: hands-on
The attendees will create a local instance of Airflow and explore the sample DAGS provided.
They will be introduced to the scheduling capabilities of the tool and track the status of the pipelines using the web GUI.

ETL task on Airflow

Format: hands-on
I will provide hints on how to transform the scripts soup into Airflow DAGS.
For this, I will use the pseudo code and other pedagogical approaches inspired by the software carpentry lessons to direct the attendees to the deployment of their first DAG in Airflow.

Wrap up and questions

Format: Q&A

Setup

https://opendata-airflow-tutorial.readthedocs.io/en/latest/setup.html


Project Homepage / Git Abstract as a tweet

learn how to automate and level up data pipelines with Airflow

Python Skill Level

professional

Domain Expertise

some

Domains

Jupyter, Scientific data flow and persistence

Tania is a Research Engineer and developer advocate with vast experience in academic research and industrial environments. Her main areas of expertise are within data-intensive applications, scientific computing, and machine learning. One of her main areas of expertise is the improvement of processes, reproducibility and transparency in research, data science and artificial intelligence.
Over the last few years, she has trained hundreds of people on scientific computing reproducible workflows and ML models testing, monitoring and scaling and delivered talks on the topic worldwide.

She is passionate about mentoring, open source, and its community and is involved in a number of initiatives aimed to build more diverse and inclusive communities. She is also a contributor, maintainer, and developer of a number of open source projects and the Founder of Pyladies NorthWest UK.

Tania has vast experience providing both workshops and talks all over the world, from big conferences such as PyCon to smaller user groups or interest groups. She is interested in both technical talks as well as talks covering community aspects.