JuliaCon Local Paris 2025

New Features of the Beta Machine Learning Toolkit (BetaML): Missing Value Imputation, Autoencoders, and Variable Importance Metrics
02/10/2025 , Amphithéâtre Robert Faure
Langue: English

BetaML.jl is a lightweight, pure-Julia machine learning library with a scikit-learn-like, consistent API, designed for usability over complexity. It supports decision trees, neural networks, clustering algorithms, and essential ML workflows, with heuristics that simplify usage and a one-parameter autotuning system. This talk highlights its new features: missing value imputation, non-linear dimensionality reduction via autoencoders, and variable importance metrics.


In the contemporary landscape of machine learning (ML), there is an increasing demand for frameworks that prioritize accessibility and usability, particularly in educational settings, rapid prototyping, and research workflows. BetaML.jl responds to this demand with a lightweight, pure Julia library that offers a consistent and intuitive interface, inspired by the design principles of scikit-learn. The library emphasizes ease of use and low entry barriers, while still supporting a rich set of ML models and utilities tailored to both novice and advanced users.

At the core of BetaML.jl is a scikit-learn-like API centered on a small set of consistent functions: model instantiation through Model(), training with fit!, prediction using predict, and, where applicable, reconstruction of input data via inverse_predict. This uniform interface fosters code clarity and supports a shallow learning curve, particularly for users transitioning from Python-based ML workflows. Importantly, the library adopts a usability-first philosophy: hyperparameters come with meaningful defaults, many are automatically inferred through heuristic strategies, and a unified one-parameter autotuning mechanism enables efficient model calibration with minimal user intervention.

BetaML.jl provides implementations for both supervised and unsupervised learning. In the domain of supervised learning, the library includes decision trees and random forests—offering interpretable, non-parametric models suitable for both classification and regression—as well as customizable feedforward neural networks. These are implemented in pure Julia and are designed to handle small to medium-sized datasets efficiently. For unsupervised learning, BetaML.jl includes clustering algorithms such as K-Means and K-Medoids for hard partitioning, alongside Gaussian Mixture Models (GMMs) for probabilistic soft clustering and density estimation. All models conform to the same API paradigm and can be easily integrated into broader workflows.

In addition to the core models, BetaML.jl incorporates an extensive suite of utilities that facilitate the end-to-end machine learning pipeline. These include data preprocessing tools such as scalers and encoders, sampling methods for validation (including k-fold cross-validation), and a variety of loss functions and performance metrics. These components are designed to work seamlessly with the predictive models, enhancing reproducibility and reducing boilerplate code.

This presentation will also highlight a set of newly introduced features that significantly extend the library’s functionality. First, missing value imputation is now natively supported through a suite of imputation models. These include simple heuristics such as mean and median imputation, as well as more sophisticated approaches based on Gaussian mixture models, random forests, or general-purpose regressors. The imputation framework supports iterative estimation, multiple imputations, and integration with external models, offering robustness in the face of incomplete data.

Second, BetaML.jl introduces autoencoders as part of its non-linear dimensionality reduction toolkit. Autoencoders are implemented as deep neural networks capable of learning compressed, information-preserving representations of high-dimensional data. This functionality is particularly relevant for feature extraction, denoising, and visualization of complex data structures. In addition, a PCA-based linear encoder is available, providing users with both linear and non-linear dimensionality reduction options within the same framework.

A third major addition is the provision of variable importance metrics, aimed at improving model interpretability. The new FeatureRanker utility offers both mean decrease in accuracy and variance decomposition (SOBOL index) importance scoring, allowing users to quantify the contribution of each input variable to model performance. This feature supports multiple ranking strategies, including permutation-based scoring, “permute and relearn”, or—if supported by the estimator—“fit once and omit in prediction.” It is compatible with any estimator, not necessarily from BetaML, that provides a fit/predict API.

To conclude the session, the talk will include a demonstration of a complete ML pipeline using BetaML.jl. This will involve loading and preprocessing a dataset, training a model with automatic hyperparameter tuning, applying autoencoders for dimensionality reduction, and interpreting model predictions through variable importance analysis. The goal is to provide attendees with a practical understanding of how BetaML.jl can be used to construct efficient and reproducible ML workflows entirely within the Julia ecosystem.

In sum, BetaML.jl offers a coherent, user-oriented design that balances simplicity and power. It serves both as an educational platform and a research-grade tool for applied machine learning in Julia, with particular attention to model transparency, ease of deployment, and extensibility.

Antonello Lobianco, PhD, is a research engineer employed by a French grande école (polytechnic university). He works on the biophysical and economic modeling of the forest sector and is responsible for the lab models portfolio. He programs in C++, Perl, PHP, Visual Basic, Python, and Julia. He teaches environmental and forest economics at the undergraduate and graduate levels and modeling at the PhD level. For a several years, Antonello has been following the development of Julia, as it fits his modeling needs. He is the author of a few Julia packages, particularly on data analysis and machine learning