Merging machine learning and econometric algorithms to improve feature selection with Julia
2019-07-23, 15:00–15:10, Room 349

Working on our previous contributions for JuliaCon 2018 (see GlobalSearchRegresssion.jl, GlobalSearchRegressionGUI.jl, and [our JuliaCon 2018 Lighting Talk] ( we develop a new GlobalSearchRegression.jl version merging LASSO and QR-OLS algorithms, and including new outcome capabilities. Combining machine learning (ML) and econometric (EC) procedures allows us to deal with a much larger set of potential covariates (e.g. from 30 to hundresds) preserving most of the original advantages of all-subset regression approaches (in-sample and out-of sample optimality, model averaging results and residuals tests for coefficient robustness). Additionally, the new version of GlobalSearchRegression.jl allows users to obtain LATEX and PDF outcomes with best model results, model averaging estimations and key statistics distributions

Applied scientific research increasingly uses Fat-Data (e.g. large number of explanatory variables relative to number of observations) for feature selection purposes. Previous version of our all-subset-regression Julia package was unable to deal with such databases. Existing ML packages (e.g. Lasso.jl) overcome this problem paying a cost in terms of statistical inference, coefficient robustness and feature selection optimality (because ML algorithms focus on prediction not on explanation or causal-prediction). The new GlobalSearchRegression.jl version combines regularization pre-processsing with all-subset-regression algorithms to efficiently work with Fat Data without losing EC-strengths in terms of sensitivity analysis, residual properties and coefficient robustness.
In the first 3 minutes, our Lighting talk will discuss GlobalSearchRegression.jl new capabilities. We will focus on the main advantages of merging ML and EC algorithms for feature selection when the number of potential covariates is relatively large: ML provides efficiency and sub-sample uncertainty assessment while EC guarantees in-sample and out-of-sample optimality with covariate uncertainty assessment.
Then, we will show different benchmarks for the new GlobalSearchRegression.jl package against R and Stata counterparts, as well as against their own original version. Our updated ML-EC- algorithm written in Julia is up to 100 times faster than similar R or Stata programs, and allows working with hundreds of potential covariates (while the upper limit for the original GlobalSearchRegression.jl version was 28).
Finally, we will use the last 4 minutes for a live hands-on example to show the Graphical User Interface, execute the ML-EC algorithm with fat data and analyze main results using new output capabilities in Latex-PDF.

Co-authors – Pablo Gluzmann