GOOD 2026

Domain-Aware Workflow Monitoring in Open OnDemand
2026-03-11 , Breakout Room

Researchers rely on Open OnDemand (OOD) to design, launch, and manage scientific workflows on high performance computing (HPC) systems. In addition to job creation and submission, this process also includes monitoring job progress. OOD provides the job-listing app and session cards for monitoring running jobs. However, these tools primarily provide scheduler-level information, forcing researchers to manually retrieve workflow and domain-specific information. Drona Workflow Engine bridges this gap by providing integrated, domain-specific monitoring throughout the workflow lifecycle. In this presentation, we will focus on job-monitoring capabilities and explore various types of job monitoring integrated into the representative, commonly used scientific workflows.


Open OnDemand is the go-to platform for researchers running scientific workflows (structured collections of interdependent computational steps used to carry out a scientific experiment or analysis) on HPC systems. Running scientific workflows typically involves two primary steps: batch job creation and submission, followed by job monitoring. Open OnDemand supports these activities through the job composer and project manager applications for job submission, as well as the job-listing app for monitoring job progress. Although useful, the information provided by the job-listing app is largely limited to batch-scheduler metadata, and session cards offer only minimal additional context, often forcing researchers to rely on command-line tools to understand workflow progress, performance, and domain-specific behavior. Effective job monitoring requires visibility into both system-level metrics, such as CPU and GPU utilization, and domain-specific progress indicators, including epochs completed in AI workflows, convergence metrics for numerical solvers, and workflow-specific output patterns. For machine-learning workflows built with PyTorch or TensorFlow, seamless integration with tools such as TensorBoard is a key requirement. Similarly, for AlphaFold workflows, exposing domain specific pipeline stages like MSA generation, model inference, and structure relaxation provides meaningful insights. Currently, it is up to the researcher to retrieve the relevant information, often using command-line tools. Addressing this inconvenience, Drona Workflow Engine is a framework that provides detailed, domain-specific batch script creation, submission, and job monitoring that allows researchers to focus on scientific insight rather than manual tracking. In this presentation, we demonstrate these capabilities using a representative scientific workflow fully integrated with Drona. The demonstration presents rich observational views, including CPU and GPU utilization and standard Slurm job information. It also supports interactive control actions, such as canceling portions of a workflow or requesting additional walltime. Finally, it exposes domain-specific workflow signals, including completed steps and runtime warnings.