2026-08-13 –, Room 4
Reproducibility in scientific analyses is often hampered by insufficient tooling. Large data and slow computations force users to save partial results to disk, and the actual steps to reproduce the entire chain from raw data to end results are lost. Here we present ReproducibleJobs.jl - a computational framework that enables natural workflows and fast turnaround, while still achieving reproducibility and show how it works in practice for single cell expression data in SingleCellProjections.jl.
In this talk, we will describe how ReproducibleJobs.jl is structured, and show how it is used in SingleCellProjections.jl to achieve reproducible and practical workflows for large single cell expression data.
ReproducibleJobs.jl is a framework for reproducible analyses of scientific data, that is based on the following ideas:
* The burden of reproducibility should be moved from the user to the packages they use for analysis, when possible.
* Memoization/caching is a good strategy because while the raw data can be large, the computed results are in essence much smaller.
* It is possible to create succinct specifications of how to perform analyses.
Results in ReproducibleJobs.jl are lazy. Consider a simple but standard single cell workflow that look something like this:
counts = load_counts(["paths/to/large/files"])
transformed = sctransform(counts)
normalized = normalize_matrix(transformed)
reduced = pca(normalized; nsv=100)
To actually retrieve the result of the PCA (Principal Component Analysis) computation, the user then needs to call fetch!(reduced). This is what enables ReproducibleJobs.jl to work under the hood. The lazy result is in fact a specification, a recipe, of what to compute. And if the computation was already memoized (cached), even in an earlier Julia session, ReproducibleJobs.jl can load the result from disk directly, without accessing any data for the earlier analysis steps. Importantly, several steps are taken to standardize the specifications, such that only changes that actually affect the results cause recomputations.
The talk also describe some of the technical challenges and solutions relating to:
* Specification design
* Specification metaprogramming/preprocessing - going from "intent" to "implementation"
* Hashing and caching
* Canonical representations of specifications and why they are important
* Strategies for handling high-level data types (such as tables, DataMatrices (matrix + variable & observation annotations)) in specifications
* Stability across Julia sessions and package versions
* How specification metaprogramming enables projections of one dataset onto another in SingleCellProjections.jl
SingleCellProjections.jl is a package for analyzing Single Cell expression data in Julia, with support for loading, transforming, normalizing, filtering, doing dimension reduction, performing statistical tests and more. Internally, SingleCellProjections.jl uses matrix expressions built from sparse and low rank matrices, thus avoiding speed and memory problems that competing (R/python) packages face when using large dense matrices.
Rasmus Henningsson’s research interests are centered around high-dimensional biological data in general and Leukemia in particular. He is currently developing new methods for dimension reduction, analysis and visualization of single cell expression data. He got his PhD degree in applied mathematics at Lund University in 2018, working on dimension reduction, viral evolution and Leukemia.