2024-11-11 –, Aula Magna
The data volume produced by astronomical experiments continues to grow with each new generation of instrumentation. This is particularly true for heterodyne receivers transitioning from single-pixel to multi-pixel arrays (hundreds of pixels), as we are doing with the CCAT Heterodyne Array Instrument (CHAI) at the upcoming CCAT observatory. Previous-generation receivers, like GREAT aboard SOFIA, with 21 pixels, generated up to 50-70 GB of data per flight. While challenging, these data volumes could still be
reduced and analyzed on local computers using traditional methods, such as the GILDAS software from IRAM. However, CHAI is expected to produce a peak data rate of 8TB per day. This volume crosses the threshold where traditional single-computer pipelines are insufficient, necessitating a migration to an automated high-performance computing (HPC) environment.
CHAI is one of two instruments at the CCAT observatory. The other instrument, Prime-Cam, a modular receiver with up to seven science modules, will yield a similar data rate. To manage these large data volumes from the CCAT observatory, we are developing the CCAT Data Center at the University of Cologne.
In this presentation, I will discuss the limitations of our traditional in-house single-dish heterodyne data reduction pipelines, such as those used at SOFIA and the NANTEN2 telescope, and how these limitations hinder migration to a distributed, fully automated computational environment. I will also present our aproach for the CCAT Data Center to overcome these challenges. Specifically, we are transitioning to a Python-based pipeline optimized for distributed computing and HPC environments where we aim to to use existing solutions where possible. By employing a central database to track data from planning through observation, data transfer, reduction, and analysis, and by using a workflow management system to orchestrate the data reduction process, we aim to minimize manual interaction and increase efficiency.
However, implementing these solutions is not without challenges. One significant
challenge is that existing solutions from other groups often meet 90% of our needs on
paper, but the specifics of our data formats and processing requirements often prevent
easy integration or native use. My hope is that by sharing our experiences, we can
foster discussions with other groups to make our solutions more general and to learn
from our respective experiences.