Universitat Rovira i Virgili (Pedro Garcia Lopez) EuroSciPy 2025

Universitat Rovira i Virgili (Pedro Garcia Lopez)
.ical

Pedro Garcia is professor of the Computer Engineering and Mathematics Department at the University Rovira i Virgili (Spain). He leads he “Cloud and Distributed Systems Lab” research group and coordinates large research european projects. In particular, he leads CloudStars (2023-2027), NearData (2023-2025), CloudSkin (2023-2025), and he participates as partner in EXTRACT (2023-2025). He also coordinated FP7 CloudSpaces (2013-1015), H2020 IOStack (2015-2017) and H2020 CloudButton (2019-2022).

During 2019-2020 he worked as visiting scientist in IBM Watson Research in the Hybrid Clouds group focused on serverless technologies. His research topics are distributed systems, cloud computing, data analytics, software architectures and middleware. He has published more than 100 papers on journals and prestigious conferences (ACM Middleware, IEEE ICDCS, USENIX FAST, ICDE, IMC). He has participated in scientific committees of different conferences like Middleware, CCGRID, CloudCom, CIC, P2P, CLOSER, or WETICE among others. He is currenlty co-organizing the International Workshop on Serverless Computing (WoSC).

Affiliation:

Universitat Rovira i Virgili

Position / Job:

Full Professor

Homepage

X handle:

@pedrotgn

Photo: euroscipy-2025/question_uploads/664260_x7Qibwt.jpg

Session

08-21

14:05

30min

Processing Cloud-optimized data in Python (Dataplug)

Universitat Rovira i Virgili (Pedro Garcia Lopez), Daniel Alejandro Coll Tejeda

The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.

Computational Tools and Scientific Python Infrastructure

Room 1.19 (Ground Floor)

Universitat Rovira i Virgili (Pedro Garcia Lopez) .ical

Session

Universitat Rovira i Virgili (Pedro Garcia Lopez)
.ical