Astronomical Data Analysis Software & Systems XXXIV

A Reproducible Science Workflow System: DALiuGE in Action
2024-11-12 , Aula Magna

The DALiuGE science workflow system has been introduced to the ADASS audience in 2022 and since then it has evolved into a sophisticated tool allowing the construction, scheduling and execution of arbitrarily complex workflows on single machines and clusters with thousands of compute nodes. Almost any software package exposing a Python binding can be automatically introspected, including the in-line documentation and argument types. The extracted individual components, classes, methods and functions, can then be used to construct workflows in a graphical editor. Unlike most other workflow systems, in DALiuGE application and data components are represented as nodes on a workflow graph. Fundamentally, this concept enables the extreme scalability as well as the separation of I/O from the algorithms. Data components can reside in memory, even across a compute cluster. Application components can be as complex as full MPI applications or as small as a single line function call. Along the whole workflow design, scheduling and execution chain, DALiuGE is recording hash codes of components and data artefacts into a Merkle tree and enables complex comparisons of the equivalence of graphs, software components, data artefacts and complete execution runs. Workflows and component descriptions are stored in user configurable GitHub or GitLab repositories and are thus fully version controlled and can be shared with collaborators or the world. DALiuGE also supports workflows, containing sub-workflows. These sub-workflows can be scheduled and executed at run-time, either on the same platform as the main workflow or somewhere else. When using existing software packages, users don't need to write any at all and can fully concentrate on the workflow design. The parameterisation of existing, established graphs to run on different datasets or re-run with slightly changed configuration of the individual components has been streamlined into a single table interface for entire graphs, exposing pre-selected so-called graph configuration parameters.

See also: