Reproducible Data Science in Python
2019-09-02, 11:00–12:30, Track4 (Chillida)

In this tutorial, we will take a detailed look at the concept of reproducibility, survey the landscape of existing solutions, and, using one solution in particular, Renku, we will do some hands-on work.

The expectation of reproducibility in scientific work has been established for several hundred years, and, increasingly, communities and funding sources are actually demanding it. Within the Python ecosystem, there are now a variety of tools available to support reproducible data science, but choosing and using one is not always straightforward. One source of confusion is simply the number of available options. Beyond that, the term "reproducibility" can mean multiple things, making it difficult to compare tools.

In this tutorial, we will examine reproducibility from the perspective of the philosophy of science. That will give us the concepts and vocabulary necessary to precisely understand and discuss different definitions of the term and allow us to identify the technologies that provide the building blocks for reproducible data science. We will briefly survey the landscape of existing solutions and then spend the remaining time looking at one solution in particular, Renku, which we will use to work end-to-end through a reproducible data-science scenario.

  • 0:00 - 0:35 Introduction & Background

    • 0:00 - 0:15 Reproducibility, a philosophy of science perspective
      • Overview of reproducibility issues in different domains of science (Nature 2016 survey results)
      • Definition of different degrees of reproducibility: Reproducibility, replicability, and repeatability
      • Examine the function of reproducibility in the scientific process
    • 0:15 - 0:25 Building blocks for reproducibility: clean code, workflow automation, version control, containerization, provenance tracking
    • 0:25 - 0:35 Survey of the Tool Landscape: Binderhub, Pachyderm, Beaker, Gigantum, Whole Tale, SingularityHub, DVC, Stencila, dotscience, amie, CodeOcean, Renku
  • 0:35 - 1:30 Hands-on session with Renku, following

Domains – Statistics, Scientific data flow and persistence, Simulation, Political and Social Sciences, Medicine/Health, Materials Science, Machine Learning, Jupyter, Earth, Ocean and Geo Science, Data Visualisation, Big Data, Astronomy Project Homepage / Git – Domain Expertise – none Python Skill Level – professional Project Homepage / Git – Abstract as a tweet – Come learn about doing Reproducible Data Science in Python at EuroSciPy 2019!