Managing Scientific Data and Workflows with DataLad EuroSciPy 2025

Managing Scientific Data and Workflows with DataLad
.ical

2025-08-19 15:30–17:00, Room 1.19 (Ground Floor)

The flourishing of open science has created an unprecedented opportunity for scientific discovery through the global exchange of data and collaboration between researchers. DataLad (datalad.org) supports this by providing the tools to develop flexible and decentralized collaborative workflows while upholding scientific rigor. It is free and open source data management software, built on top of the version control systems Git and git-annex. Among its major features are version control for files of any size or type, data transport logistics, and digital process provenance capture for reproducible digital transformations.
In this hands-on workshop, we will start by exploring DataLad’s basic functionality and learn how to run and re-run analyses while versioning and keeping track of your data. Following this, we will explore DataLad’s collaborative features and learn how to install and work with existing datasets and how to share and distribute your work online. After completing this tutorial, you will be equipped to start using DataLad to manage your own research projects and share them with the world.

The tutorial will begin with a short introduction to DataLad, describe typical use-cases and explain how DataLad uses the version control systems git and git-annex to manage, track and transport data.
After this short introduction, we will start with the first hands-on block where participants will explore the core concepts of DataLad that enable reproducible research, such as version control and digital provenance. We will learn how to create and configure a DataLad data set (datalad create), how to add and modify data (datalad [status, save, unlock]) and how to see changes in the data set's history (git log). We will see how DataLad can be used to run a Python script while keeping track of the input and output of that script (datalad run). The record of running the script that DataLad produced can then be used to conveniently re-run parts of the analysis pipeline after making changes to the script (datalad rerun). After this hand-on session, participants will understand the basic DataLad functionalities required to manage their data analysis project on their local machines.
Next, we will present how DataLad can be used to install existing datasets and collaborate with others, how it supports the FAIR (Findable, Accessible, Interoperable, and Reusable) principles for data sharing and discover and how it integrates with open-science platforms like the Open Science Framework (OSF).
This is followed by another hand-on session where participants will learn how to install existing online datasets (datalad [clone, get]) and check the identity and availability of files (git-annex [info, whereis]). We will also explore how DataLad can create and manage siblings of a data set, allowing the user to back up and share their data (datalad [create-sibling-*, push, update]). After this session, participants will have the tools for working with existing data sets provided by open-science platforms and for creating collaborative workflows.

Participants are expected to bring their own computer. Before the tutorial, we will provide installation instructions and be available for potential troubleshooting to ensure that every participant is able to follow the exercises. Prior knowledge of Git or Git-annex is not required, but participants who are familiar with these tools may get a deeper understanding of the inner workings of DataLad. Familiarity with a Unix-like terminal is also an advantage.

Expected audience expertise: Domain:

none

Expected audience expertise: Python:

none

Supporting material: Supporting material Project homepage or Git: Project homepage or Git Your relationship with the presented work/project:

Original author or co-author

Ole Bialas

I studied Biology at the University of Tübingen, where I first learned how to code using Matlab. Then, I moved to Leipzig, where I did a master’s degree and later a PhD in neurobiology. In my research, I studied how the brain processes sound location using electroencephalography (EEG) and custom experimental setups for spatial audio. During that time, I started using Python and eventually co-authored “slab”, a Python toolbox for psychoacoustic experiments. After my PhD, I moved to the University of Rochester in New York, where I studied how the brain processes naturalistic speech by modeling EEG that was recorded while the participants listened to audiobooks. For this research I published another toolbox, originally written in Matlab, called “mTRFpy”. As my postdoc was coming to an end, I was looking for a position where I could combine my interest in neuroscience with my passion for programming. I found such a position at the University Clinic Bonn where I currently work as a research software consultant. In this position, I develop and teach workshops where neuroscience researchers can improve their software skills. I also do one-on-one consulting to help neuroscientists deal with the computational challenges they are faced in their research.

Michał Szczepanik

My background is in neuroinformatics and cognitive neuroscience. I currently work in research data management and research software development. I am a DataLad contributor.

Managing Scientific Data and Workflows with DataLad .ical 2025-08-19 15:30–17:00, Room 1.19 (Ground Floor)

Managing Scientific Data and Workflows with DataLad
.ical

2025-08-19 15:30–17:00, Room 1.19 (Ground Floor)