2022-11-04 –, ADASS Conference Room 1
The Data Activate Flow Graph Engine (DALiuGE) is a workflow scheduling and execution system. It has been developed with extreme scalability ability, ranging from stand-alone laptops to the biggest supercomputers in the world, as one of the key requirements. One of the other requirements was to be able to re-use existing (radio astronomy) software, while also allowing a tighter integration of dedicated processing components with the execution engine. The third main requirement was separation of concerns to allow astronomers and workflow developers to concentrate on the workflow logic and the selection of appropriate algorithmic components, software engineers on the development and maintenance of operational grade components providing those algorithms, HPC specialists on the optimization of the execution of the components and the overall workflow and finally also enable hardware/software co-design for performance or I/O critical components. In the meantime DALiuGE has reached beta status and is ready for broader consumption. An earlier version has been used to drive the 2020 Gordon Bell prize finalist project to demonstrate that we would be able to process SKA scale data streams on a supercomputer the size of Summit. During that run we have used 99 percent of Summit (27,360 GPUs), achieving 130 petaflops peak performance for single-precision, 247 gigabytes per second data generation rate, and 925 gigabytes per second pure I/O rate.
This paper presents an overview of the system and cover the key aspects of the implementation of each of the requirements mentioned above.