Dreadful Frailties in Propensity Score Matching and How to Fix Them.
2024-09-26 , Gaston Berger

In their seminal paper "Why propensity scores should not be used for matching," King and Nielsen (2019) highlighted the shortcomings of Propensity Score Matching (PSM). Despite these concerns, PSM remains prevalent in mitigating selection bias across numerous retrospective medical studies each year and continues to be endorsed by health authorities. Guidelines to mitigating these issues have been proposed, but many researchers encounter difficulties in both adhering to these guidelines and in thoroughly documenting the entire process.

In this presentation, I show the inherent variability in outcomes resulting from the commonly accepted validation condition of Standardized Mean Difference (SMD) below 10%. This variability can significantly impact treatment comparisons, potentially leading to misleading conclusions. To address this issue, I introduce A2A, a novel metric computed on a task specifically designed for the problem at hand. By integrating A2A with SMD, our approach substantially reduces the variability of predicted Average Treatment Effects (ATE) by up to 90% across validated matching techniques.

These findings collectively enhance the reliability of PSM outcomes and lay the groundwork for a comprehensive automated bias correction procedure. Additionally, to facilitate seamless adoption across programming languages, I have integrated these methods into "popmatch," a Python package that not only incorporates these techniques but also offers a convenient Python interface for R's MatchIt methods.


This presentation aims at raising awareness on the pressing challenge of evaluating bias correction methods, which suffers from a lack of ground truth for validation. While validation typically relies on expert opinion,we propose a novel metric and lay the groundwork for a comprehensive automated pipeline. This pipeline aims to offer consistent results across experiments, even if not optimal, thereby addressing the current variability in outcomes.

Primarily geared towards practitioners in the medical field who utilize propensity score matching, this talk offers practical examples and steers clear of excessive theoretical discussions. Nevertheless, the core message is applicable to various bias correction methods across different domains. The presentation aims to identify the problem, provides a rudimentary solution ready for refinement, and advocates for public engagement in the project.

The talk is structured as follows:

  • Identifying the Problem: Using real and synthetic datasets, we demonstrate how different population matching techniques can produce varying results, highlighting the need for a robust evaluation method.

  • Introduction of Ground Truth in Bias Correction: We demonstrate how existing data can be leveraged to construct artificial problems with known solutions, offering a benchmark for evaluation.

  • Crafting a Metric: We present A2A, a novel metric designed to evaluate bias correction methods based on the repeated creation of synthetic problems. We employ this metric to gain a deeper understanding of the phenomenon.

  • Future Perspectives: We discuss the limitations of the proposed metric and avenues for refinement. Additionally, we contextualize this endeavor within the broader framework of building a fully automated bias correction engine, highlighting remaining tasks to encourage contributions.

Lead data scientist at Implicity.