2019-09-02 –, Track4 (Chillida)
In this workshop, you will learn how to migrate from ‘scripts soups’ (a set of scripts that should be run in a particular order) to robust, reproducible and easy-to-schedule data pipelines in Airflow.
Introduction (5 minutes)
Format: presentation
Go over the agenda
List the relevant resources
Make sure everyone has followed the installation instructions
Intro to data pipelines
Format: presentation
Go over the components of traditional data science pipelines
Presentation of the scripts soup anttipatern
Creating a script soup
Format: hands-on
The attendees will perform an ETL task on some data using a set of independent scripts.
In this exercise, I will provide and explain the code and explain what we are trying to achieve with this pseudo-pipeline. The attendees will have a chance to try and reproduce it themselves.
Introduction to Airflow and DAGS
Format: presentation
Introduce the concept of DAGs (directed acyclic graphs)
Present and introduce the components of Airflow
Airflow documentation
Set up a local instance of Airflow
Format: hands-on
The attendees will create a local instance of Airflow and explore the sample DAGS provided.
They will be introduced to the scheduling capabilities of the tool and track the status of the pipelines using the web GUI.
ETL task on Airflow
Format: hands-on
I will provide hints on how to transform the scripts soup into Airflow DAGS.
For this, I will use the pseudo code and other pedagogical approaches inspired by the software carpentry lessons to direct the attendees to the deployment of their first DAG in Airflow.
Wrap up and questions
Format: Q&A
Setup
https://opendata-airflow-tutorial.readthedocs.io/en/latest/setup.html
learn how to automate and level up data pipelines with Airflow
Python Skill Level:professional
Domain Expertise:some
Domains:Jupyter, Scientific data flow and persistence
Tania is a Research Engineer and developer advocate with vast experience in academic research and industrial environments. Her main areas of expertise are within data-intensive applications, scientific computing, and machine learning. One of her main areas of expertise is the improvement of processes, reproducibility and transparency in research, data science and artificial intelligence.
Over the last few years, she has trained hundreds of people on scientific computing reproducible workflows and ML models testing, monitoring and scaling and delivered talks on the topic worldwide.
She is passionate about mentoring, open source, and its community and is involved in a number of initiatives aimed to build more diverse and inclusive communities. She is also a contributor, maintainer, and developer of a number of open source projects and the Founder of Pyladies NorthWest UK.
Tania has vast experience providing both workshops and talks all over the world, from big conferences such as PyCon to smaller user groups or interest groups. She is interested in both technical talks as well as talks covering community aspects.