Gabriel Andres Jaimes Illanes
- JAE Intro ICU Scholar - Institute of Astrophysics of Andalusia (IAA).
- MSc. Astrophysics, University of La Laguna (ULL) in agreement with the Institute of Astrophysics of the Canary Islands (IAC). (Carolina Foundation Scholar)
- MEng. Electronic Science and Technology, Beijing University of Aeronautics and Astronautics (BUAA - Beihang) (Chinese Scholarship Council Scholar).
- National Astronomy Education Coordinator (NAEC) for Bolivia - Office of Astronomy for Education (OAE) / International Astronomical Union (IAU)
- Projects in charge: Bolivian Virtual Observatory (BVO), Astro3DBol - Educational kits.
Session
Hydrogen is the most abundant element in the universe, making it essential to the formation and evolution of galaxies. The 21 cm radio wavelength neutral atomic hydrogen (HI) line maps the distribution and dynamics of gas within galaxies. The emission from this spectral line is an important tracer for galaxy interaction studies and understanding galactic structure, star formation processes and general behavior of the Interstellar Medium. The application of Machine Learning (ML) and BigData tools algorithms are assets to tackle the enhancement of the quality and efficiency of scientific analysis in this field, especially when it involves large radio astronomy databases and the study of spectrum classification.
Within this context, our work aims to propose a framework for the classification of HI spectral profiles using ML techniques. Several methodologies integrating unsupervised ML techniques and Convolutional Neural Networks (CNN) have been implemented. To carry out this approach, we have focused on HI datasets used in the AMIGA (Analysis of the interstellar Medium in Isolated GAlaxies) research group with a sample of 318 CIG (Catalog of Isolated Galaxies) spectral profiles and 30780 profiles from the ALFALFA (Arecibo Legacy Fast ALFA) survey.
To design this classification framework the first step was data preprocessing, using the Busyfit package (Westmeier et al, 2014) for HI spectrum profile fitting. A second data set was generated using iterative fitting with polynomial, Gaussian, and double-Lorentzian models. This approach also involved a multi-faceted strategy for profile clustering based on temporal shapelet transformation for features detection algorithms: K-means, spectral clustering, DBSCAN, agglomerative clustering, among others, as bootstrap for the extraction of features. Furthermore, we considered a series of classification techniques that include K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Random Forest classifiers. In order to optimize the performance of such models, CNN model was probed, where we made an in-depth evaluation for various configurations of the model with regard to their impact on classification accuracy.
The second part of this work is focused on the generation of an additional dimension to the profiles in order to improve the classification. This 2D analysis is based on the application of CNN techniques to determine the degree of asymmetry by carrying out the classification of the sample of CIG galaxies. The original data was modified by adding a new dimension to the profiles in order to improve the classification. Three distinct 2D image models were generated for the symmetry study: the first is a rotation of the fitted spectrum, the second involves rotating the spectrum after subtracting its right and left profiles to accentuate asymmetry features, and the third is a normalized version of the previous image, with pixel intensity adjusted to further emphasize specific image features. We explain the methodology with current ML techniques and discuss the extrapolation to the ALFALFA survey. The resulting classification was compared with a profile classification previously made by the AMIGA scientific group (Espada, 2011).
The study presents the application of ML techniques for classifying HI profiles, including an approach to extract profile asymmetries classification with HI profiles transformation into 2D images, to improve the accuracy and depth of future analyses. With this, we also have the intention to build and verify a minimal methodology that could potentially be applied to the ongoing Square Kilometre Array (SKA) precursor surveys such as MeerKAT (MIGHTEE HI) or Apertif, where the number of detections will be higher, thus laying the foundation for building a full-scope methodology in the SKA era. All material, code and models have been produced following the FAIR principles and have been published in an open access public repository.