PyConDE & PyData Berlin 2024

Missing Data, Bayesian Imputation and People Analytics with PyMC
2024-04-23 , B09

We demonstrate a range of different approaches to missing data imputation in employee engagement survey data. Contrasting frequentist style full-information maximum likelihood approaches with more direct Bayesian imputation and chained equation methods, we highlight how the different assumptions regarding the missing-data license different inferences about the imputed values and ultimately the plausible causal narratives which can be expressed in PyMC. In particular we avail of the hierarchical nature of employee engagement data to justify a hierarchical approach to justifying the (MAR) missing-at-random assumption for imputation schemes in People Analytics.


There is no "agnostic statistics" when approaching the question of missing data. Theory quickly breaks against reality in the context people-analytics. All imputation schemes need to justify their assumptions of "strong-ignorability" or "missing-at-random" reasons for missing data. This is easier and cleaner in a Bayesian setting than in frequentist alternatives. This transparency is important when dealing with HR data. We will demonstrate both full information maximum likelihood (FIML) and Bayesian imputation by chained equation approaches to the imputation of missing data in the context of employee engagement survey data.

We will use the probabilistic programming language PyMC to articulate the structures and conditional probabilities around missing data in hierarchical organisations. Non-response bias in engagement survey data often corrupts the overall picture of organisational health and modelling of the non-response bias helps uncover patterns or trends in the patterns of missing-ness. These insights can be used diagnostically to locate the source of problems within the organisation, but we need to be willing to commit to the assumptions that license genuine causal inference. In this way we present the problem of missing-data as a gate-way to an organisational focus on causal inference problems. Somewhat ironically, the lack of data can actually makes the problems of causal inference more concrete for business stakeholders.


Expected audience expertise: Domain:

Novice

Expected audience expertise: Python:

Novice

Public link to supporting material, e.g. videos, Github, etc.:

https://github.com/pymc-devs/pymc-examples/pull/500

Abstract as a tweet (X) or toot (Mastodon):

Hierarchical structures are everywhere in business! Ever wondered how trickle-down management missteps drive non-response bias in Employee Engagement? Model the hierarchy, model the missing-ness with PyMC!

I'm a data scientist from Dublin, working at Personio on a range of revenue or customer focused areas. Previously I worked with CarTrawler on pricing and insurance risk modelling, and with Marsh and McLennan in areas of re-insurance and catastrophic risk. Before this i worked in Paddy Power Betfair on models of risk indicators for gambling as part of a responsible gambling initiative. I''m broadly interested in problems of risk and confounding.