Reproducible Data Science in Python
2019-09-02 , Track4 (Chillida)

In this tutorial, we will take a detailed look at the concept of reproducibility, survey the landscape of existing solutions, and, using one solution in particular, Renku, we will do some hands-on work.


The expectation of reproducibility in scientific work has been established for several hundred years, and, increasingly, communities and funding sources are actually demanding it. Within the Python ecosystem, there are now a variety of tools available to support reproducible data science, but choosing and using one is not always straightforward. One source of confusion is simply the number of available options. Beyond that, the term "reproducibility" can mean multiple things, making it difficult to compare tools.

In this tutorial, we will examine reproducibility from the perspective of the philosophy of science. That will give us the concepts and vocabulary necessary to precisely understand and discuss different definitions of the term and allow us to identify the technologies that provide the building blocks for reproducible data science. We will briefly survey the landscape of existing solutions and then spend the remaining time looking at one solution in particular, Renku, which we will use to work end-to-end through a reproducible data-science scenario.

  • 0:00 - 0:35 Introduction & Background

    • 0:00 - 0:15 Reproducibility, a philosophy of science perspective
      • Overview of reproducibility issues in different domains of science (Nature 2016 survey results)
      • Definition of different degrees of reproducibility: Reproducibility, replicability, and repeatability
      • Examine the function of reproducibility in the scientific process
    • 0:15 - 0:25 Building blocks for reproducibility: clean code, workflow automation, version control, containerization, provenance tracking
    • 0:25 - 0:35 Survey of the Tool Landscape: Binderhub, Pachyderm, Beaker, Gigantum, Whole Tale, SingularityHub, DVC, Stencila, dotscience, amie, CodeOcean, Renku
  • 0:35 - 1:30 Hands-on session with Renku where we will develop a typical data-science use-case, focusing on the building blocks of reproducibility along the way.

Requirements and set up instructions

We will run the tutorial on https://renkulab.io so please register and create an account following these instructions.

To follow along with the slides, go here


Project Homepage / Git:

https://github.com/SwissDataScienceCenter/renku

Project Homepage / Git:

https://github.com/SwissDataScienceCenter/renku

Abstract as a tweet:

Come learn about doing Reproducible Data Science in Python at EuroSciPy 2019!

Python Skill Level:

basic

Domain Expertise:

none

Domains:

Statistics, Scientific data flow and persistence, Simulation, Political and Social Sciences, Medicine/Health, Materials Science, Machine Learning, Jupyter, Earth, Ocean and Geo Science, Data Visualisation, Big Data, Astronomy

Chandrasekhar studied mathematics at the University of California, Berkeley (B.A. 1997) and art and computer science at the University of California, Santa Barbara (M.A. 2003). He has worked as a software developer and consultant for companies, research institutions, and NGOs in the US, Germany, and Switzerland. Since 2009, he has been at ETH Zürich supporting projects by developing software solutions for data management, analysis, and visualization. In addition to his work at ETH, he teaches data visualization at Propulsion Academy and, as Illposed works on artistic projects that incorporate data as a central component.