EuroSciPy 2026

A Hands-On Introduction to Mechanistic Interpretability
2026-07-23 , Room 1.38 (Ground Floor, Turing)

Large language models (LLMs) have become central to modern scientific computing, yet for most practitioners they remain opaque systems - input goes in, text comes out, and the internal mechanism is a mystery. Mechanistic interpretability (MI) is the emerging discipline of reverse-engineering what specific components of a neural network actually do.
Using Andrej Karpathy's microgpt - a fully self-contained, 200-line, dependency-free GPT implementation in pure Python - as our subject, we systematically dissect what a trained language model has learned. No PyTorch, no specialised ML frameworks: just the familiar tools applied to a genuinely novel problem.

The model is tiny by design: 4,192 parameters, a 27-token vocabulary (a–z + a special token), trained on 32,000 names in roughly one minute on a laptop. This makes it the ideal subject for interpretability work - every attention weight is inspectable, every embedding printable, every head ablatable. The scientific question driving the tutorial is: "What has this model actually learned about the structure of names?"


We work through four concrete investigations:
* We extract the token embedding matrix and apply PCA to ask whether vowels and consonants form geometrically distinct clusters — testing the hypothesis that the model has learned something about phonetic structure purely from next-character prediction.
* We extract attention weight matrices for specific inputs and visualise them as heatmaps, then run a simple hypothesis test (via scipy.stats) asking whether the model attends more strongly to repeated characters.
* We perform systematic head ablation — zeroing individual attention heads and measuring loss change — to identify which heads are load-bearing and which are redundant.
* We use the gradients already computed by microgpt's autograd engine to perform logit attribution: tracing which embedding dimensions most strongly influenced a given prediction.

Participants will leave with a modular, reusable codebase, a concrete mental model for mechanistic interpretability, and pointers to how these techniques scale to production models via TransformerLens and Anthropic's circuits research.


Expected audience expertise: Domain: some Expected audience expertise: Python: some Supporting material: Supporting material Your relationship with the presented work/project: Original author or co-author

Previous experience working as a data scientist on varied business propositions ranging from detecting scientific fraud in publishing, supply chain optimization, customer attrition, upselling/cross-selling card products, web personalization and customer-merchant affinity.

This speaker also appears in: