EuroSciPy 2026

The Illusion of Compliance: Auditing LLM-as-a-Judge Systems
2026-07-21 , Room 1.19 (Ground Floor, Shannon)

LLM-as-a-Judge systems are increasingly deployed in high-stakes settings - screening job applicants, triaging medical cases, assessing credit risk, and flagging legal exposure. As the EU AI Act takes effect in August 2026 with penalties up to €35M for biased high-risk systems, organizations are investing heavily in fairness audits. But passing a bias check does not guarantee fairness. Standard Python fairness pipelines rarely detect this shift. In a controlled hiring experiment on real resumes, we demonstrate how alignment and potentially bias-mitigation techniques can reduce aggregate disparities while redistributing harm across intersectional subgroups.


Consider a hiring model that shows equal acceptance rates for men and women, and equal rates for white and non-white candidates. Every single-axis dashboard is green. Yet Black women are rejected at nearly twice the rate of any other group. Social scientists call this intersectionality - the recognition that discrimination operates non-additively. A Black woman's experience isn't racism + sexism; the intersection creates distinct disadvantages. The bias doesn't disappear - it moves.
We’ll walk through Python workflows that:

  • Move beyond single-attribute slicing to multi-dimensional group analysis

  • Implement additivity testing (quantify non-linear discrimination)

  • Detect dimensional heterogeneity (when gender improves but race worsens)

  • Surface trade-offs introduced by alignment and tuning

Although the empirical case centers on hiring, the evaluation framework generalizes to any high-stakes LLM-as-Judge deployment. Attendees will leave with a reproducible evaluation framework grounded in 50 years of social science research, practical tools for EU AI Act compliance, and a clearer understanding of what meaningful compliance requires in regulated environments.


Expected audience expertise: Domain: none Expected audience expertise: Python: some Your relationship with the presented work/project: Original author or co-author

Previous experience working as a data scientist on varied business propositions ranging from detecting scientific fraud in publishing, supply chain optimization, customer attrition, upselling/cross-selling card products, web personalization and customer-merchant affinity.

This speaker also appears in: