Vasu Sharma
- Currently interested in AI Safety and Alignment Research | Writes at https://permutedsense.substack.com/
- Open source and privacy-preserving tools enthusiast
Previous experience working as a data scientist on varied business propositions ranging from detecting scientific fraud in publishing, supply chain optimization, customer attrition, upselling/cross-selling card products, web personalization and customer-merchant affinity.
she/her
Sessions
LLM-as-a-Judge systems are increasingly deployed in high-stakes settings - screening job applicants, triaging medical cases, assessing credit risk, and flagging legal exposure. As the EU AI Act takes effect in August 2026 with penalties up to €35M for biased high-risk systems, organizations are investing heavily in fairness audits. But passing a bias check does not guarantee fairness. Standard Python fairness pipelines rarely detect this shift. In a controlled hiring experiment on real resumes, we demonstrate how alignment and potentially bias-mitigation techniques can reduce aggregate disparities while redistributing harm across intersectional subgroups.
Large language models (LLMs) have become central to modern scientific computing, yet for most practitioners they remain opaque systems - input goes in, text comes out, and the internal mechanism is a mystery. Mechanistic interpretability (MI) is the emerging discipline of reverse-engineering what specific components of a neural network actually do.
Using Andrej Karpathy's microgpt - a fully self-contained, 200-line, dependency-free GPT implementation in pure Python - as our subject, we systematically dissect what a trained language model has learned. No PyTorch, no specialised ML frameworks: just the familiar tools applied to a genuinely novel problem.
The model is tiny by design: 4,192 parameters, a 27-token vocabulary (a–z + a special token), trained on 32,000 names in roughly one minute on a laptop. This makes it the ideal subject for interpretability work - every attention weight is inspectable, every embedding printable, every head ablatable. The scientific question driving the tutorial is: "What has this model actually learned about the structure of names?"