PyData Boston 2025

Yunxin Gao

Yunxin holds a Bachelor’s degree in Applied Statistics from the University of Wisconsin–Madison and a Master’s degree in Applied Statistics from New York University, with a focus on data science and big data. Since completing graduate school, Yunxin has worked under Model Risk in the finance industry for the past 2.5 years, where they specialize in evaluating, validating, and interpreting complex quantitative models. Their experience spans statistical modeling, machine learning, and model risk management, with a strong emphasis on translating analytical insights into actionable business decisions.


Session

12-10
09:45
40min
Rethinking Feature Importance: Evaluating SHAP and TreeSHAP for Tree-Based Machine Learning Models
Yunxin Gao

Tree-based machine learning models such as XGBoost, LightGBM, and CatBoost are widely used, but understanding their predictions remains challenging. SHAP (SHapley Additive exPlanations) provides feature attributions based on Shapley values, yet its assumptions — feature independence, additivity, and consistency — are often violated in practice, potentially producing misleading explanations.
This talk critically examines SHAP’s limitations in tree-based models and introduces TreeSHAP, its specialized implementation for decision trees. Rather than presenting it as perfect, we evaluate its effectiveness, highlighting where it succeeds and where explanations remain limited. Attendees will gain a practical, critical understanding of SHAP and TreeSHAP, and strategies for interpreting tree-based models responsibly.

Target audience: Data scientists, ML engineers, and analysts familiar with tree-based models.
Background: Basic understanding of feature importance and model interpretability.

Thomas Paul