PyData London 2026

Beyond ML Model Calibration: Hands-On Multicalibration with MCGrad
2026-06-05 , Hardwick Hub

Your model is well-calibrated on average, but is it calibrated for every subgroup of your users? In this hands-on tutorial you will learn what multicalibration is, why standard calibration methods leave systematic errors hidden in subpopulations, why this matters for ML models in production, and how to fix it in a few lines of code using MCGrad, an open-source Python library that has been battle-tested on hundreds of production models at a large tech company. Attendees will leave with a working notebook they can immediately apply to their own projects.


A globally well-calibrated model can still be systematically overconfident for one subgroup and underconfident for another, these errors cancel out in aggregate, passing standard checks while silently degrading decisions for specific populations. Multicalibration fixes this by ensuring predictions are calibrated across all subgroups simultaneously, while improving other notions of model performance.

This tutorial introduces multicalibration from scratch using MCGrad, an open-source library (pip install mcgrad) that has been deployed on hundreds of production ML models at a major tech company, and the methodology was recently accepted at KDD 2026. Attendees train a classifier on a public dataset, discover hidden subgroup miscalibration, then fix it with MCGrad in a few lines of code, all inside a ready-to-run Colab notebook. We also cover hyperparameter tuning, safety mechanisms, and when not to apply multicalibration.

OUTLINE:
- Welcome & Setup (5 min)
Goals, format, open Colab notebook, pip install mcgrad.
- The Calibration Gap (15 min)
What is calibration? And why should ML practitioners care about it? Train a logistic regression on the dataset. Apply isotonic regression -- global calibration looks perfect. Reveal: the model is still badly miscalibrated for specific subgroups.
- From Calibration to Multicalibration (15 min)
Define multicalibration and the MCE metric. Why practitioners need it: you rarely know which subgroups matter in advance. Deployment lessons from a major tech company (hundreds of production models).
- MCGrad in Action -- Hands-On (30 min)
Walk through the MCGrad API (fit/predict). Fit MCGrad on the dataset, inspect the learning curve, compare base model vs. isotonic regression vs. MCGrad. Visualise segment-level error reduction. Mini-exercise: change segment features, observe impact on MCE.
- Advanced Features & Production Tips (15 min)
Hyperparameter tuning, safety mechanisms (no-op failsafe), regression multicalibration, model serialization, when not to use multicalibration.
- Wrap-Up & Q&A (10 min)
Recap the three-step workflow (measure MCE, fit MCGrad, verify). Pointers to docs and tutorials. Open Q&A.

Attendees leave with a working notebook, a new metric multicalibration error (MCE) for auditing their own models, and a pip-installable tool to act on the results.

Niek Tax is a Staff Research Scientist and Tech Lead at Meta's Central Applied Science team in London. He focuses on longer-term, foundational work that addresses new opportunities and challenges across Meta, bridging the gap between academic rigour and product teams. Niek has extensive experience overseeing the end-to-end lifecycle of production-grade ML systems, from research to global deployment. His expertise is in uncertainty quantification, including active learning and probability calibration, and he has published articles at NeurIPS and KDD on those topics.

Before joining Meta, Niek worked as an ML engineer at Booking.com and in applied R&D at Philips Research. He holds a PhD in Computer Science from Eindhoven University of Technology, and has authored 35+ peer-reviewed publications with over 2,500 citations.