2020-07-31 –, Green Track
Studies in immunology, developmental biology, and medicine use flow and mass cytometry to generate huge amounts of single-cell data. GigaSOM.jl is a high-performance, horizontally scalable implementation of the commonly used clustering and visualization algorithms used in cytometry, designed to handle dataset of sizes inaccessible to currently available tools. We show the structure and design of GigaSOM.jl, and demonstrate the results on recent datasets from a massive immunophenotyping effort.
GigaSOM is an implementation of the Self-Organizing-Maps algorithm by Kohonen that facilitates the clustering and dimension reduction of huge-scale datasets, counting billions of individual data points with tens of dimensions. Its development, showcased at the 2019 JuliaCon conference, is motivated by the needs of flow and mass cytometry data analysis, relevant in immunology, developmental biology and clinical medicine: Individual cells from the measurements need to be precisely categorized (which is currently best done by the self-organizing maps as devised by van Gassen et al. (2015)), and eventually evaluated and visualized.
GigaSOM is able to perform this precise kind of computation on large compute clusters, and facilitates the analysis to scale horizontally. We will describe a Julia toolkit for map-reduce-style computation and data distribution in the common HPC environments, which we developed for the purposes of GigaSOM. The toolkit cooperates with the Distributed package, and works well within common cluster software, e.g. Slurm. With that in hand, we demonstrate high-level implementation of SOMs and related algorithms (e.g. EmbedSOM (Kratochvíl et al., 2019)) that scale horizontally, show measurements of the performance, and demonstrate the results achievable on several datasets, including the data from International Mouse Phenotyping Consortium (Brown & Moore, 2012). Notably, our testing showed that 1 billion data points can be processed within only minutes using relatively common computer clusters or cloud compute grids, which vastly expands the possibilities of large-scale data analysis.
The quality of the software package is assured using ARTENOLIS (https://artenolis.lcsb.uni.lu) (Heirendt et al., 2017). Biological validation of the results is performed by comparison to conventional implementations of the FlowSOM package and manual analysis.
PhD student at Charles university in Prague, Dept. of SW engineering. Assistant researcher at Institute of Organic Chemistry and Biochemistry in Prague. Working on various computationally intensive problems in bioinformatics and cheminformatics.