07-12, 16:20–16:30 (Europe/Amsterdam), Method (1.5)
We introduce Cairn.jl, a Julia library of state-of-the-art active learning algorithms for training machine learning interatomic potentials for molecular dynamics simulation. This package provides a unifying Julia-based platform for rapid comparison and prototyping of active learning schemes, including a novel technique based on kernel Stein discrepancy for data querying and labeling.
Background. The adoption of machine learning methods in atomistic simulations has accelerated the tractability of computational materials characterization and discovery. Machine learning interatomic potentials (ML-IPs) are developed as surrogate models for estimating electronic structure properties such as potential energy and forces of local atomic environments. From a training dataset of quantum mechanical (QM) calculations of energy and forces, ML-IPs have the potential to achieve quantum-level accuracy at a fraction of the cost, enabling molecular dynamics (MD) simulation on scales at which macroscopically observable chemical and physical properties can be quantified.
The central challenge in machine learning model development is the tight dependency of the model’s performance on the choice of reference data used for training. Developing a reference dataset for training ML-IPs is non-trivial, involving manual assembly of datasets based on expert heuristics or sampling atomic configurations from MD simulation, which often exhibit metastable dynamics and fail to converge on the canonical probability distribution [1]. As a response, several algorithms for active learning for ML-IPs have been introduced, where the objective is to most efficiently improve model performance through the iterative acquisition of training data and labeling using expensive QM calculations as an oracle. As noted in [2], comparison of these methods is difficult, due to “inhomogeneous [...] notation schemes” and implementations in different programming languages and platforms.
Contribution. We introduce Cairn.jl, a Julia library of active learning algorithms for training ML-IPs. These techniques assemble diverse datasets of atomic configurations representing a wide range of energetic states to train ML-IPs capable of characterizing chemical and physical processes of interest. The active learning strategy in Cairn.jl is an alternating-stage algorithm where the first stage centers on data acquisition through sampling by MD simulation. Users may choose between existing simulators - such as Velocity Verlet or Langevin dynamics - and “enhanced” simulators implemented in Cairn.jl, in which a biasing force term in the equation of motion guides the simulation path toward underrepresented regions of the free energy landscape [3,4,5]. The samplers include:
* Query-by-committee methods [3]
* Uncertainty-driven dynamics [4,5]
* Stein repulsive dynamics [7,8]
The second stage centers on data labeling, in which a criterion calculated during MD simulation triggers retraining of the ML-IP. An uncertainty or diversity metric determines a subset of samples from the MD trajectory for which to label using calls to the oracle and augment to the training set. The retraining criteria include:
* Thresholds using ensemble-based or GP-based estimates of uncertainty [3,4,5,6]
* Thresholds using similarity kernel evaluations (kernel Stein discrepancy) [7]
We show that ML-IPs trained using iteration of this active learning scheme achieve greater accuracy with significant gains in efficiency compared to those trained with brute force methods of data acquisition and labeling.
Research impact. This work provides one of the first platforms to unify recent active learning algorithms for ML-IPs. Moreover, it contains the first implementation of a novel active learning scheme derived from Stein variational gradient descent, to be detailed in a future article. The package provides the optionality to execute MD simulation on multiple software platforms, including the widely used LAMMPS and the Julia-native Molly.jl. Cairn.jl will enable easy comparison of the techniques, such that users may quickly prototype active learning routines and choose the algorithm best suited for their problem.
Relevance to the Julia community. Cairn.jl fills a key gap in the growing Julia ecosystem for molecular simulation, existing in the software suite of CESMIX (Center for Exascale Simulation of Materials in Extreme Environments) and interfacing with packages such as PotentialLearning.jl, InteratomicPotentials.jl, Molly.jl, and ACEpotentials.jl. Moreover, the package serves as a starting point for general purpose active learning algorithms for machine learning model development which rely on MCMC-type sampling of probability distributions, which is a broader interest of the Julia scientific community.
References.
[1] Comer (2015). J of Phys Chem B. doi: 10.1021/jp506633n.
[2] Henin (2022). LiveCoMS. doi: 10.33011/livecoms.4.1.1583.
[3] Smith (2018). J of Chem Phys. doi: 10.1063/1.5023802.
[4] Kulichenko (2023). Nat Comp Sci. doi: 10.1038/s43588-023-00406-5.
[5] van der Oord (2023). npj Comp Mat. doi: 10.1038/s41524-023-01104-6.
[6] Vandermause (2020). npj Comp Mat. doi: 10.1038/s41524-020-0283-z.
[7] Liu (2016). NeurIPS. url: http://arxiv.org/abs/1608.04471.
[8] Ye (2020). NeurIPS. url: http://arxiv.org/abs/2002.09070.
PhD Student in Computational Science & Engineering at MIT