How do we reason about the reliability of our data pipeline in Wrike?
07-14, 09:00–09:45 (US/Pacific), NYC Meetup,[Sessions start: Tuesday 14.07 12pm (Tuesday 14.07 9am PDT)]

We’re using airflow for almost two years now and scaled it from two users to 8 teams.

We would like to share our story, how we reason about the reliability of our data pipelines.

We will tell:
How are we establishing a reliable review process on AirFlow?
How we’re using multiple-airflow configuration to migrate from our DC to cloud and to reuse the production data in acceptance wherever possible.
How do we use data versioning to make sure that data is up-to-date throughout the pipeline?


Establishing a reliable review process on AirFlow
* We have a single acceptance environment where results of data pipelines should be reviewed.
* A data review can take up to several weeks.
* We'd like to release several pipelines a week.
* But to make data review reliable, we need to make feature freeze.
* We’re using mini-repositories to do it.

Multiple-airflow configuration
* To start the migration to the cloud.
* To implement an acceptance environment that is reusing data from production, stores and counts only the pipelines that are going through the review process.
* To provide a fast and isolated environment for the analytics team.

Data versioning
* Version of increment in acceptance to distinguish from the prod version.
* Version is used to make sure that the data pipeline is calculating the version of data it was tested with.
* We have multiple versions of the same data in our pipeline: some of the analytical reports need to freeze source data before usage.

Work in Data Engineering at Wrike since August 2016.

Migrated product data engineering ETLs between Spark clusters
Leading the migration from our own hardware to GCP and BigQuery
Make data available to engineers across the company