2020-07-29 –, Red Track
AutoMLPipeline (AMLP) is a package that makes it trivial to create
complex ML pipeline structures using simple
expressions. AMLP leverages on the built-in
macro programming features of Julia
to symbolically process, manipulate
pipeline expressions, and
makes it easy to discover optimal structures
for machine learning prediction and classification.
The typical workflow in machine learning
classification or prediction requires
some or combination of the following
preprocessing steps together with modeling:
- feature extraction (e.g. ica, pca, svd)
- feature transformation (e.g. normalization, scaling, ohe)
- feature selection (anova, correlation)
- modeling (rf, adaboost, xgboost, lm, svm, mlp)
Each step has several choices of functions
to use together with their corresponding
parameters. Optimizing the performance of the
entire pipeline is a combinatorial search
of the proper order and combination of preprocessing
steps, optimization of their corresponding
parameters, together with searching for
the optimal model and its hyper-parameters.
Because of close dependencies among various
steps, we can consider the entire process
to be a pipeline optimization problem (POP).
POP requires simultaneous optimization of pipeline
structure and parameter adaptation of its elements.
As a consequence, having an elegant way to
express pipeline structure helps in the analysis
and implementation of the optimization routines.
Package Features
- Pipeline API that allows high-level description of processing workflow
- Common API wrappers for ML libs including Scikitlearn, DecisionTree, etc
- Symbolic pipeline parsing for easy expression
of complexed pipeline structures - Easily extensible architecture by overloading just two main interfaces: fit! and transform!
- Meta-ensembles that allow composition of
ensembles of ensembles (recursively if needed)
for robust prediction routines - Categorical and numerical feature selectors for
specialized preprocessing routines based on types
To illustrate, a typical machine learning workflow that extracts
numerical features (numf) for ICA (independent component analysis) and
PCA (principal component analysis) transformations, respectively,
concatentated with the hot-bit encoding (ohe) of categorical
features (catf) of a given data for RF modeling can be expressed
in AMLP as:
julia> model = @pipeline (catf |> ohe) + (numf |> pca) + (numf |> ica) |> rf
julia> fit!(model,Xtrain,Ytrain)
julia> prediction = transform!(model,Xtest)
julia> score(:accuracy,prediction,Ytest)
Paulito Palmes is a Research Scientist in IBM Research (Dublin Research Lab). He is working on the following research themes: AutoAI/AutoML, Interactive AI, Explainable AI, AI Planning, and ML Pipeline optimization.