2025-04-25 –, Hassium
Generalized linear models (GLMs) are interpretable, relatively quick to train, and specifying them helps the modeler understand the main effects in the data. This makes them a popular choice today to complement other machine-learning approaches. glum
was conceived with the aim of offering the community an efficient, feature-rich, and Python-first GLM library with a scikit-learn-style API. More recently, we are striving to keep up with PyData community's ongoing push for dataframe-agnosticism.
While glum
was originally heavily based on pandas
, with the help of narwhals
, we are close to being able to fit models on any dataset that the latter supports. This talk presents our experiences with achieving this goal.
Arguably, glum
's standout feature is its ability to efficiently handle datasets consisting of a mix of dense, sparse and categorical features. To facilitate this, it relies on our (similarly open-source) tabmat
library, which provides classes and useful methods for mixed-sparsity data. glum
fits models by first converting input data to tabmat
matrices, and then using those matrices to do the necessary computations.
Therefore, dataframe-agnostism in our case mostly boils down to handling the conversion of different dataframes to tabmat
matrices (which themselves store data in numpy
arrays and sparse scipy
matrices) in an efficient manner. Most of it is rather smooth and straightforward due to narwhals
providing a convenient compatibility layer for a wide range of dataframe functionality. However, we have encountered a couple of pain points that might be of interest to other package maintainers and the PyData community. In particular,
- We heavily rely on manipulating the category order and encoding of categorical variables, for which there is somewhat limited support. This is due to various dataframe libraries handling cateegorical columns somewhat differently.
- Most dataframe libraries do not support sparse columns, while for us, it is important to be able to accept sparse inputs.
In this talk I demonstrate how we used narwhals
to easily accept multiple types of dataframes. I will go into details about categorical and sparse columns, and present the challenges we encountered with those. I will also examine the benefits and challenges of supporting sparse columns in dataframe libraries and the Arrow stardard. These points are meant to facilitate discussion among the participants and in the PyData community.
At the end of the talk I will also briefly mention potential future plans for glum
and tabmat
, including the possibility to do computations directly on Arrow objects without converting them to numpy
and scipy
arrays.
Outline
- A short intro to
glum
, it's backend librarytabmat
, and the main ideas that make them performant. - Making
glum
dataframe-agnostic.- Showcase how
narwhals
simplifies handling a wide variety of dataframes. - Discuss handling categorical (and enum/dictionary) columns.
- Talk about representing sparse columns in dataframes.
- Showcase how
- Concluding remarks and potential future plans.
Target audience
- Basic understanding of the scientific Python ecosystem (with a focus on dataframe libraries) is recommended.
- While some familiarity with linear models might be useful to get the most out of this talk, it is by no means required.
Main takeaways
- How
glum
efficiently handles mixed-sparsity data - How
narwhals
helps to achieve dataframe-agnosticism with little effort - Differences between categorical types in various packages and the Apache Arrow specification.
- How support for sparse column could be incorporated into dataframe libraries and the Arrow Columnar Format
Intermediate
Expected audience expertise: Python:Intermediate
Public link to supporting material, e.g. videos, Github, etc.:Martin is a data-scientist/engineer at QuantCo. He is mainly working on developing the software packages that Quantco uses for insurance risk modeling and pricing. This includes QuantCo's open-source generalized linear modeling package, glum.
He has a background in economics, and has previously worked at the Central Bank of Hungary as an applied researcher. He also taught a number of 'Programming for Economists' courses for college and PhD students.