Streamlining the Cosmos: Pythonic Workflow Management for Astronomical Analysis PyCon DE & PyData 2025

Streamlining the Cosmos: Pythonic Workflow Management for Astronomical Analysis
.ical

2025-04-23 17:50–18:20, Palladium

Astronomical surveys are growing rapidly in complexity and scale, necessitating accurate, efficient, and reproducible reduction and analysis pipelines. In this talk we explore Pythonic workflow managers to streamline processing large datasets on distributed computing environments.

Modern astronomy generates vast datasets across the electromagnetic spectrum. NASA's flagship James Webb Space Telescope (JWST) provides unprecedented observations that enable deep studies of distant galaxies, cosmic structures, and other astrophysical phenomena. However, these datasets are complex and require intricate calibration and analysis pipelines to transform raw data into meaningful scientific insights.

We will discuss the development and deployment of Pythonic tools, including snakemake and pixi, to construct modular, parallelized workflows for data reduction and analysis. Attendees will learn how these tools automate complex processing steps, optimize performance in distributed computing environments, and ensure reproducibility. Using real-world examples, we will illustrate how these workflows simplify the journey from raw data to actionable scientific insights.

As astronomical surveys continue to grow in size and sophistication, researchers face mounting challenges in building efficient, scalable, and reproducible data processing pipelines. Modern observatories, like NASA's James Webb Space Telescope (JWST), are delivering unprecedented volumes of complex and specialized data, requiring innovative approaches to transform raw observations into meaningful, scientifically valid results. This talk focuses on leveraging Pythonic workflow management tools to address the unique challenges of processing large-scale astronomical datasets efficiently and reproducibly.

I will provide a brief overview of JWST including its capabilities and the groundbreaking science it has enabled. In particular we will focus on the Pure Parallel mode which collect serendipitous observations from regions of the sky adjacent to primary science targets. These opportunistic datasets are a powerful resource for blind extragalactic surveys, offering unique opportunities to uncover faint galaxies, cosmic structures, and rare astrophysical phenomena. However, their “unscheduled” and heterogeneous nature presents significant challenges: the data arrive in raw, uncalibrated formats and require intricate, multi-step workflows—such as artifact masking, background subtraction, and galaxy spectral analysis—before becoming scientifically usable.

In this talk, I will demonstrate how tools like Snakemake and Pixi offer powerful, Pythonic solutions for these challenges. I’ll show how these tools allow scientists to design modular, scalable, and highly parallel workflows that automate the reduction and analysis process while efficiently distributing computation across high-performance computing (HPC) clusters and cloud environments. By breaking workflows into smaller, reusable components, we can improve computational performance and maintain flexibility to adapt pipelines to new datasets, instruments, or evolving scientific goals.

Reproducibility remains a critical pillar of modern science, and I will highlight how combining workflow managers with environment management tools ensures version-controlled pipelines, transparent data lineage tracking, and reliable replication of results. This enables consistent analyses across diverse systems, fostering collaboration and long-term usability of scientific products.

This talk is designed for (data) scientists and researchers working with large-scale or complex datasets with multi-step reduction/analysis pipelines. This talk will provide a GitHub repository containing resources for building modular workflows, including examples of existing infrastructures, to provide actionable takeaways. Attendees will leave with a clear understanding of how to apply modern workflow management techniques to streamline their processing pipelines, improve reproducibility, and scale their analyses to meet the demands of their datasets.

Expected audience expertise: Domain:

Novice

Expected audience expertise: Python:

Intermediate

Raphael Hviding

Hello! I am Raphael Hviding, a Postdoctoral Researcher in the Data Science Department at the Max-Planck Institute for Astronomy.
Scientifically I am interested in studying the lives of galaxies beyond our own Milky Way, how they formed, and the role that supermassive black holes play in governing galaxy evolution.
Computationally my skills are in workflow management for complex data processing pipelines and in data-driven Bayesian modelling of astronomical data.

Streamlining the Cosmos: Pythonic Workflow Management for Astronomical Analysis .ical 2025-04-23 17:50–18:20, Palladium

Streamlining the Cosmos: Pythonic Workflow Management for Astronomical Analysis
.ical

2025-04-23 17:50–18:20, Palladium