ADASS 2022

Anna Anku


Session

11-03
19:00
30min
GPUs and multiclustering for big data computing
Anna Anku

Space missions produce an unimaginable amount of data, which at some point has to be: cleaned, processed, transformed, and passed through pipelines. Later on, the data will be in one way or another stored and analysed. Multiplying that amount by the number of missions shows that not only the tools, but also the architecture, should support the immense volume of information.

These sets of mission data appear perfect to use for data analysis, with the application of various libraries or algorithms, but a question is introduced - how should the size be mitigated, so that the time complexity of the operations does not fall into the worst-case scenarios? One way to accelerate computational operations on big data is with Graphical Processing Units, which utilise the idea of SIMD - a single operation repeated on multiple data points. While the user will see the immediate benefit in the speed at which the result is obtained, behind the scenes, this means an effective use of threads, memory, and concurrent access to resources.

The other crucial aspect to be considered is sharing the data with the community. Most of the times moving or copying it across the Internet is both complex and time-consuming, so a good solution would be to bring the user to the data. This is in essence how ESA Datalabs works - the platform brings the user and their code to the information and offers custom tools for handling astronomical data, as well as those of a more general purpose (like Jupyter Notebooks). ESA Datalabs is built on Kubernetes clusters. This approach allows independence from a particular operating system with minimal virtualisation overhead. Managed properly, the clusters offer persistence and most importantly, scalability - if the users need more resources or the platform has to scale, this can be handled by adding new clusters, for example, one with a GPU.

This presentation is going to introduce the concept of GPU computing and multiclustering, how a single-cluster architecture can be expanded, and the considerations to be taken when integrating GPUs into these big data processing environments.

ADASS Conference Room 1