Scientific DevOps: Designing Reproducible Data Analysis Pipelines with Containerized Workflow Managers
2019-09-04, 14:45–15:15, Track 3 (Oteiza)

A review of DevOps tools as applied to data analysis pipelines, including workflow managers, software containers, testing frameworks, and online repositories for performing reproducible science that scales.

Open source and open science come together when the software is accessible, transparent, and owned by all. For data analysis pipelines that grow in complexity beyond a single Jupyter notebook, this can become a challenge as the number of steps and software dependencies increase. In this talk, Nicholas Del Grosso will review a variety of tools for packaging and managing a data analysis pipeline, showing how they fit together and benefit the development, testing, deployment, and publication processes and the scientific community. In particular, this talk will cover:

  • Workflow managers (e.g. Snakemake, PyDoit, Luigi) to combine complex pipelines into single applications.

  • Container Solutions (e.g. Docker and Singularity) to package and deploy the software on others' computers, including high-performance computing clusters.

  • The Scientific Filesystem to build explorable and multi-purpose applications.

  • Testing Frameworks (e.g. PyTest, Hypothesis) to declare and confirm the assumptions and functionality of the analysis pipeline.

  • Ease-of-Use Utilities to share the pipeline online and make it accessible to non-programmers.

By writing software that stays manageable, reproducible, and deployable continuously throughout the development cycle, we can better fulfill the goals of open science and good scientific practice in a digital era.

Abstract as a tweet – DevOps in science--making data analysis shareable, reproducible, and open! Project Homepage / Git Python Skill Level – professional Domain Expertise – some Project Homepage / Git Domains – Open Source, Scientific data flow and persistence