JuliaCon 2023

HighDimMixedModels.jl
07-26, 09:30–10:00 (US/Eastern), Online talks and posters

I present on a package under development, HighDimMixedModels.jl, for fitting high dimensional mixed effect regression models. The motivation comes from analysis of microbial datasets, but the model is well-suited to many settings across bioinformatics. In my poster, I explain the usage of the package as well as the underlying statistical theory.


I present on a package currently under development at http://www.github.com/solislemuslab/HighDimMixedModels.jl for fitting high dimensional mixed effect regression models. The original motivation for this package and the mixed-effect sparse learning models it will support comes from the field of microbiology. Microbial communities are among the driving forces of biogeochemical processes, and standard approaches to studying the connection between microbial communities and these biogeochemical phenomena rely on the use of abundance matrices to represent the microbial compositions as the design matrix in a regression or machine learning analysis.

These types of regressions present two major challenges. First, there is often inter-sample correlation structure due to a grouping of samples in space—in soil studies, for example, samples may come from one of several different locations—or time. These inter-sample correlation structures lead to under-powered statistical estimators as well as incorrect inferences if not properly accounted for. Secondly, in these models, the number of microbial taxa present across the range of samples to be included in the analysis, which equals the number of regression coefficients to be estimated, is quite high.

There are existing Julia packages to deal with each of these problems individually. High dimensional estimation with well-studied guarantees can be done using the Lasso, which has been implemented efficiently in the package Lasso.jl. On the other hand, MixedModels.jl is a popular package for fitting mixed effect models in Julia. The proposed package under development will deal with models that exhibit both at once: that is, they both incorporate random effects and also have a high dimensional vector of fixed effect. In particular, I am implementing the proposed estimator from Schelldorfer et al. (2011), which uses a coordinate-gradient-descent algorithm. The proposed package will be called HighDimMixedModels.jl, and it translates the R package lmmlasso built by Schelldorfer, which was subsequently removed from CRAN. My hope is that by harnessing the speed of Julia, my implementation of Schelldorfer’s model and algorithm will allow researchers in biology to fit high dimensional, mixed-effect regression models more efficiently. The software will be accompanied by step-by-step tutorials and examples similar to the ones found found in the MixedModels.jl documentation.

While at first glance, the intersection of hierarchical/grouped sampling structure with high dimensional feature space might seem like a niche, highly-specialized setting, it may actually be a common occurrence in biology in the age of -omics data. While the impetus for the development of this package is regression analysis for microbial data, the hope is that this package will be useful to researchers working with other types of data, including metagenomic data, metatranscriptomic data, or even continuous measurements like metabolites, methylation or gene expression. For an example of this last category, the model was successfully applied in [1] to a gene expression data matrix in order to identify which genes are most relevant for the production of riboflavin (vitamin B) in the bacterium Bacillus subtillis.

In my poster, I will explain how to use my package, its high level API and its implementation details. I will also explain the details of the underlying statistical theory. I hope that by introducing my package at JuliaCon, it can be useful to the many bioinformatics researchers who attend the conference.

References:

[1] SCHELLDORFER, J., BÜHLMANN, P., & VAN DE GEER, S. (2011). Estimation for High-Dimensional Linear Mixed-Effects Models Using ℓ1-Penalization. Scandinavian Journal of Statistics, 38(2), 197–214. http://www.jstor.org/stable/23015490

Evan Gorstein is a 3rd year PhD student in the Department of Statistics at the University of Wisconsin-Madison.