Airflow: your ally for automating machine learning and data pipelines PyConDE & PyData Berlin 2019

Airflow: your ally for automating machine learning and data pipelines

Now that you finally have your Machine Learning model trained, what’s the next step for moving to production?
Orchestrating, scheduling and monitoring ML inference pipelines is a big challenge.
Airflow can be your ally for handling this complexity.

After having worked hard developing a machine learning model, you know that there is still a relatively small step to do: moving it to production.

In a common scenario what you probably would like to have is a workflow to automate:
* gathering and preprocessing the data
* running inference on them
* storing the predictions

Ideally you would want a tool that can help you:
* dealing with big data
* guaranteeing robustness and resilience
* executing your workflows on a scheduled basis or when some pre-conditions are met
* resolving dependencies between tasks

If until today you used cron to schedule jobs, this could be the right time to adopt a well established tool like Apache Airflow for addressing this complexity.

Apache Airflow is an open source project written in Python for programmatically author, schedule and monitor batch execution of tasks.

You can design your pipelines according to a determined logic: decide which actions to perform, retry them if errors occur, skip tasks if dependencies are not met, access monitor and log status through a friendly and powerful web UI, and a lot more.

A very nice feature of Airflow is that all the above is configured and defined in Python code.
Therefore the Airflow pipelines can benefit from the advantages of the software development process (such as peer-reviews, automated testing and version control).

In this workshop we’ll go over basic Airflow concepts and we’ll setup an instance for orchestrating an inference pipeline for a machine learning model.

Details for Audience

It assumes no previous Airflow knowledge.
The main purpose is creating a basic train and inference pipeline with Airflow.
It is not about a particular model / ML method.
It's not an advanced Airflow workshop.
It is not suitable for Python beginners.

Workshop Requirements

Docker installed.
Any editor (Sublime, PyCharm, Vim, Atom).
Verify that Docker works properly.
Ensure that you allocated 4gb of RAM for the Docker Engine. (Can be done via desktop app, check Preferences section. After setting up, restart Docker App)
Download the Airflow Docker image: docker pull puckel/docker-airflow
Download repository under the $HOME directory.
git clone https://github.com/deliveryhero/pyconde2019-airflow-ml-workshop

Python Skill Level: expert Abstract as a tweet:

Automate your machine learning and data pipelines with Apache Airflow

Domain Expertise: some Domains: Big Data, Infrastructure, Machine Learning, Data Engineering Public link to supporting material:

https://github.com/deliveryhero/pyconde2019-airflow-ml-workshop

Enrica Pasqua

Enrica works in Berlin as a Senior Data Engineer at Delivery Hero, where she develops and maintains large scale data pipelines using Python.
Her interests include Big Data Architecture, Process Automation and Machine Learning.

Bahadir Uyarer

Bahadir is a Data Scientist at Global Marketing Tech Department of Delivery Hero. Here, he creates various data products in order to increase the efficiency of marketing activities. Prior to jumping into the tech world, he was working as a research economist in Istanbul. He holds M.A in Public Policy (Applied Economics) and MSc. in Economics and pursues his academic career as a PhD student at Bogazici University.