MendelIHT.jl: How to fit Generalized Linear Models for High Dimensional Genetics (GWAS) Data
2019-07-25, 17:15–17:25, Elm A

GWAS data are extremely high dimensional, large (>100GB), dense, and typically contains rare and correlated predictors. In this talk we discuss its unique data structures, how to efficiently represent it with Julia, how MendelIHT.jl in conjunction with Distributions.jl and GLM.jl fits generalized linear models for GWAS data, and the role of parallel computing.


Background: Marginal regression is widely employed by the genomics community to identify variants associated with complex traits. Ideally one would consider all covariates in tandem, but existing multivariate methods are sub-ideal to handle common issues of a modern genome wide association study (GWAS). Here we fill the gap with a new multivariate algorithm - iterative hard thresholding (IHT).

Method: We introduce a novel coefficient estimation scheme based on maximum likelihoods, extending the IHT algorithm to perform multivariate model estimation for any exponential family. We further discuss and implement doubly-sparse and prior knowledge-aided variants of IHT to tackle specific problems in genetics, such as linkage disequilibrium.

Results: We show how to apply IHT for any generalized linear model, and explicitly derive the updating algorithm and optimal step length for logistic and Poisson models. We provide an efficient implementation of IHT in Julia to analyze GWAS data as a module under OpenMendel. We tested our algorithm on real and simulated data to demonstrate model quality, algorithm robustness, and scalability. Then we investigate when and how (group)-(within-group) sparsity and knowledge-aided projections may help in discovering rare genetic variants with small effect size. Our implementation enjoys built-in parallelism, operates directly on raw genotype files, and is completely open sourced.

Significance: For geneticists, our method offers enhanced multivariate model selection for big data GWAS. For theorists, we demonstrate how to use IHT to find GLM coefficients, and we derive 2 variants of the thresholding operators and show when they are expected to perform better.