Machine learning bias and the annotation of large databases of astronomical objects
While the availability of autonomous digital sky surveys has been revolutionizing the field of astronomy, they also introduce new challenges in the processing and analysis of the massive databases that they generate. One of the common approaches to the annotation of large databases of astronomical objects is by applying machine learning, and specifically artificial neural networks. Neural networks have demonstrated high efficacy in assigning correct annotations to astronomical objects, and can automate the analysis of databases that are far too large to be annotated manually. But while artificial neural networks can be an invaluable tool for astronomical data analysis, they also have several downsides. Here we study and profile the possible disadvantages of artificial neural networks in the context of astronomical data analysis. The study shows that when using artificial neural network algorithms, the annotation can have subtle but consistent biases. These biases are very difficult to detect, can change in different parts of the sky, and are not intuitive for the consumers of data products annotated by machine learning and deep neural networks. Since these catalogs are in many cases very large, these subtle biases can lead to statistically significant observations that are the result of the neural network bias rather than a true reflection of the Universe. Based on these observations, catalogs annotated by current artificial neural networks should be used cautiously, and statistical observations enabled by such catalogs should be analyzed in the light of possible biases in the machine learning systems. The results reinforce the need for further research on explainable neural network architectures applied to the field of astronomical data analysis.