Building data pipelines in Python: Airflow vs scripts soup EuroSciPy 2019

Building data pipelines in Python: Airflow vs scripts soup
.ical

2019-09-02 14:00–15:30, Track4 (Chillida)

In this workshop, you will learn how to migrate from ‘scripts soups’ (a set of scripts that should be run in a particular order) to robust, reproducible and easy-to-schedule data pipelines in Airflow.

Introduction (5 minutes)

Format: presentation
Go over the agenda
List the relevant resources
Make sure everyone has followed the installation instructions

Intro to data pipelines

Format: presentation
Go over the components of traditional data science pipelines
Presentation of the scripts soup anttipatern

Creating a script soup

Format: hands-on
The attendees will perform an ETL task on some data using a set of independent scripts.
In this exercise, I will provide and explain the code and explain what we are trying to achieve with this pseudo-pipeline. The attendees will have a chance to try and reproduce it themselves.

Introduction to Airflow and DAGS

Format: presentation
Introduce the concept of DAGs (directed acyclic graphs)
Present and introduce the components of Airflow
Airflow documentation

Set up a local instance of Airflow

Format: hands-on
The attendees will create a local instance of Airflow and explore the sample DAGS provided.
They will be introduced to the scheduling capabilities of the tool and track the status of the pipelines using the web GUI.

ETL task on Airflow

Format: hands-on
I will provide hints on how to transform the scripts soup into Airflow DAGS.
For this, I will use the pseudo code and other pedagogical approaches inspired by the software carpentry lessons to direct the attendees to the deployment of their first DAG in Airflow.

Wrap up and questions

Format: Q&A

Setup

https://opendata-airflow-tutorial.readthedocs.io/en/latest/setup.html

Project Homepage / Git: No response Abstract as a tweet:

learn how to automate and level up data pipelines with Airflow

Python Skill Level:

professional

Domain Expertise:

some

Domains:

Jupyter, Scientific data flow and persistence

Dr. Tania Allard

Tania is a Research Engineer and developer advocate with vast experience in academic research and industrial environments. Her main areas of expertise are within data-intensive applications, scientific computing, and machine learning. One of her main areas of expertise is the improvement of processes, reproducibility and transparency in research, data science and artificial intelligence.
Over the last few years, she has trained hundreds of people on scientific computing reproducible workflows and ML models testing, monitoring and scaling and delivered talks on the topic worldwide.

She is passionate about mentoring, open source, and its community and is involved in a number of initiatives aimed to build more diverse and inclusive communities. She is also a contributor, maintainer, and developer of a number of open source projects and the Founder of Pyladies NorthWest UK.

Tania has vast experience providing both workshops and talks all over the world, from big conferences such as PyCon to smaller user groups or interest groups. She is interested in both technical talks as well as talks covering community aspects.

Building data pipelines in Python: Airflow vs scripts soup .ical 2019-09-02 14:00–15:30, Track4 (Chillida)