Reproducible Data Science in Python
2019-09-02, 11:00–12:30, Track4 (Chillida)

In this tutorial, we will take a detailed look at the concept of reproducibility, survey the landscape of existing solutions, and, using one solution in particular, Renku, we will do some hands-on work.


The expectation of reproducibility in scientific work has been established for several hundred years, and, increasingly, communities and funding sources are actually demanding it. Within the Python ecosystem, there are now a variety of tools available to support reproducible data science, but choosing and using one is not always straightforward. One source of confusion is simply the number of available options. Beyond that, the term "reproducibility" can mean multiple things, making it difficult to compare tools.

In this tutorial, we will examine reproducibility from the perspective of the philosophy of science. That will give us the concepts and vocabulary necessary to precisely understand and discuss different definitions of the term and allow us to identify the technologies that provide the building blocks for reproducible data science. We will briefly survey the landscape of existing solutions and then spend the remaining time looking at one solution in particular, Renku, which we will use to work end-to-end through a reproducible data-science scenario.

  • 0:00 - 0:35 Introduction & Background

    • 0:00 - 0:15 Reproducibility, a philosophy of science perspective
      • Overview of reproducibility issues in different domains of science (Nature 2016 survey results)
      • Definition of different degrees of reproducibility: Reproducibility, replicability, and repeatability
      • Examine the function of reproducibility in the scientific process
    • 0:15 - 0:25 Building blocks for reproducibility: clean code, workflow automation, version control, containerization, provenance tracking
    • 0:25 - 0:35 Survey of the Tool Landscape: Binderhub, Pachyderm, Beaker, Gigantum, Whole Tale, SingularityHub, DVC, Stencila, dotscience, amie, CodeOcean, Renku
  • 0:35 - 1:30 Hands-on session with Renku where we will develop a typical data-science use-case, focusing on the building blocks of reproducibility along the way.

Requirements and set up instructions

We will run the tutorial on https://renkulab.io so please register and create an account following these instructions.

To follow along with the slides, go here


Domains – Statistics, Scientific data flow and persistence, Simulation, Political and Social Sciences, Medicine/Health, Materials Science, Machine Learning, Jupyter, Earth, Ocean and Geo Science, Data Visualisation, Big Data, Astronomy Project Homepage / Git – https://github.com/SwissDataScienceCenter/renku Domain Expertise – none Python Skill Level – basic Project Homepage / Git – https://github.com/SwissDataScienceCenter/renku Abstract as a tweet – Come learn about doing Reproducible Data Science in Python at EuroSciPy 2019!