Daniel Alejandro Coll Tejeda
Daniel Alejandro Coll Tejeda is a dedicated researcher of Cloud and Distributed Systems Lab at the University Rovira i Virgili (URV), Tarragona. Specializing in cloud computing, his current research involves intensive data analysis and the creation of sophisticated tools designed to optimize the management of diverse cloud infrastructures, encompassing Kubernetes orchestration, serverless computing paradigms (such as AWS Lambda and Lithops), and virtual machine environments.
CloudLab
Researcher
Session
The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% — 71.31% less) without imposing significant overheads.