Processing Cloud-optimized data in Python (Dataplug) EuroSciPy 2025

Processing Cloud-optimized data in Python (Dataplug)
.ical

2025-08-21 14:05–14:35, Room 1.19 (Ground Floor)

The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.

Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud object storage without needing to download the entire dataset.
These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.

They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. For example, Dask can efficiently read data in parallel from Object Storage in CO formats like ZARR.

Cloud-optimized formats are now widely used in geospatial settings with entire datasets available in the AWS Registry for Open Data like Sentinel-2 Cloud Optimized GeoTIFFs. In this line, COPC (Cloud Optimized Point Cloud) was developed to overcome the limitations of LIDAR. Likewise, Cloud Optimized GeoTIFF (COG) was developed to facilitate cloud processing of GeoTIFF files.

Nevertheless, there are no cloud optimized versions of widely used formats in genomics (FASTA, FASTQ, VCF, FASTQGZIP) and metabolomics (imzML). Furthermore, a costly preprocessing from legacy formats is required (from GeoTIFF to COG, from LIDAR to COPC). In this talk, we will present a novel data processing library called Dataplug that enables Cloud-optimized access to legacy formats without a costly preprocessing and also avoiding huge data movements. Dataplug covers legacy formats like LIDAR but also major data formats found in bioinformatics (genomics, metabolomics) that lack appropriate Cloud Optimized alternatives.

In this talk, you will learn how to process scientific data formats in Python using the Dataplug library from any Python data analytics platform like Dask or Ray. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.

Objectives

By the end of this talk, you will be able to:

Understand Cloud-Optimized data formats and their benefits for data processing in the Cloud
Learn how to process Cloud Optimized Data from Object Storage in Python using Dask
Use Dataplug library to enable on-the-fly partitioning of Cloud Optimized data (COG, ZARR, COPC).
Use Dataplug library to enable on-the-fly partitioning of non-Cloud Optimized formats (LIDAR, FASTQGZIP, FASTA, FASTQ, VCF,imzML, MS)

Outline

Introduction (10 minutes)

About Us
Understanding Cloud-Optimized data formats and Cloud Object storage
Processing Cloud-Optimized data in Dask

Processing Cloud-optimized data in the Cloud with Python (15 minutes)

Processing COG (Cloud-Optimized GeoTIFFs) in Python in the NDVI pipeline
On-the-fly processing of compressed genomic data (FASTQGZIP) with Dataplug
On-the-fly processing of metabolomics data (imzML) with Dataplug
Commparing LIDAR and COPC processing with Dataplug library in Dask (code)

Conclusions (2 minutes)

Audience

The talk is aimed at Python developers interested in processing data in the Cloud. In particular, it may be of interest in the following domains:
geospatial data (COG, COPC, LIDAR, ZARR, Kerchunk), genomics data (FASTA, FASTQ, VCF, FASTQGZIP), astronomics (MS), and metabolomics data (imzML).
This talk requires basic understanding of Cloud Object Storage.

Expected audience expertise: Domain:

some

Expected audience expertise: Python:

some

Project homepage or Git: Project homepage or Git Your relationship with the presented work/project:

Original author or co-author

Universitat Rovira i Virgili (Pedro Garcia Lopez)

Pedro Garcia is professor of the Computer Engineering and Mathematics Department at the University Rovira i Virgili (Spain). He leads he “Cloud and Distributed Systems Lab” research group and coordinates large research european projects. In particular, he leads CloudStars (2023-2027), NearData (2023-2025), CloudSkin (2023-2025), and he participates as partner in EXTRACT (2023-2025). He also coordinated FP7 CloudSpaces (2013-1015), H2020 IOStack (2015-2017) and H2020 CloudButton (2019-2022).

During 2019-2020 he worked as visiting scientist in IBM Watson Research in the Hybrid Clouds group focused on serverless technologies. His research topics are distributed systems, cloud computing, data analytics, software architectures and middleware. He has published more than 100 papers on journals and prestigious conferences (ACM Middleware, IEEE ICDCS, USENIX FAST, ICDE, IMC). He has participated in scientific committees of different conferences like Middleware, CCGRID, CloudCom, CIC, P2P, CLOSER, or WETICE among others. He is currenlty co-organizing the International Workshop on Serverless Computing (WoSC).

Daniel Alejandro Coll Tejeda

Daniel Alejandro Coll Tejeda is a dedicated researcher of Cloud and Distributed Systems Lab at the University Rovira i Virgili (URV), Tarragona. Specializing in cloud computing, his current research involves intensive data analysis and the creation of sophisticated tools designed to optimize the management of diverse cloud infrastructures, encompassing Kubernetes orchestration, serverless computing paradigms (such as AWS Lambda and Lithops), and virtual machine environments.

Processing Cloud-optimized data in Python (Dataplug) .ical 2025-08-21 14:05–14:35, Room 1.19 (Ground Floor)

Objectives

Outline

Audience

Processing Cloud-optimized data in Python (Dataplug)
.ical

2025-08-21 14:05–14:35, Room 1.19 (Ground Floor)