Scientific DevOps: Designing Reproducible Data Analysis Pipelines with Containerized Workflow Managers
09-04, 14:45–15:15 (UTC), Track 3 (Oteiza)

A review of DevOps tools as applied to data analysis pipelines, including workflow managers, software containers, testing frameworks, and online repositories for performing reproducible science that scales.


Open source and open science come together when the software is accessible, transparent, and owned by all. For data analysis pipelines that grow in complexity beyond a single Jupyter notebook, this can become a challenge as the number of steps and software dependencies increase. In this talk, Nicholas Del Grosso will review a variety of tools for packaging and managing a data analysis pipeline, showing how they fit together and benefit the development, testing, deployment, and publication processes and the scientific community. In particular, this talk will cover:

  • Workflow managers (e.g. Snakemake, PyDoit, Luigi) to combine complex pipelines into single applications.

  • Container Solutions (e.g. Docker and Singularity) to package and deploy the software on others' computers, including high-performance computing clusters.

  • The Scientific Filesystem to build explorable and multi-purpose applications.

  • Testing Frameworks (e.g. PyTest, Hypothesis) to declare and confirm the assumptions and functionality of the analysis pipeline.

  • Ease-of-Use Utilities to share the pipeline online and make it accessible to non-programmers.

By writing software that stays manageable, reproducible, and deployable continuously throughout the development cycle, we can better fulfill the goals of open science and good scientific practice in a digital era.


Project Homepage / Git Project Homepage / Git Abstract as a tweet

DevOps in science--making data analysis shareable, reproducible, and open!

Python Skill Level

professional

Domain Expertise

some

Domains

Open Source, Scientific data flow and persistence

Nicholas Del Grosso is an American neuroscientist post-doc in Germany who is passionate about open, reproducible science. Besides teaching data analysis and programming to scientists in courses, workshops, and at PyData Munich, he builds scientific software to study the learning process itself--from understanding the brain's responses to exposure to machine-brain interfaces, rat's understanding of 3D virtual environments, and scientists's responses to the stress of managing their own experiments!

Note: Nick is currently looking for a post-doctoral position to work on problems related to reproducible science! If you're looking for someone like him, send him a message or come say hello!