Reading
Papers and posts that inform the research direction. Curated, not exhaustive.
-
Circuits Foundation
A Mathematical Framework for Transformer Circuits
Elhage et al., 2021
Linear-algebra toolbox for reasoning about attention and MLP computation as composable circuits.
-
Representation Geometry
Toy Models of Superposition
Elhage et al., 2022
Why neurons mix many concepts in one direction and when sparse features can still be recoverable.
-
Mechanistic Case Study
In-context Learning and Induction Heads
Olsson et al., 2022
Identifies a concrete head family explaining the algorithmic core of in-context next-token prediction.
-
Mechanistic Case Study
Interpretability in the Wild: IOI in GPT-2
Wang et al., 2022
Reverse-engineers a full multi-component language circuit rather than isolated heads or neurons.
-
Feature Discovery
Towards Monosemanticity
Bricken et al., 2023
The practical SAE workflow for extracting cleaner, more human-readable latent features.
-
Feature Discovery
Sparse Autoencoders Find Highly Interpretable Features
Templeton et al., 2023
SAE features can be interpreted and causally steered in a language model.
-
Scaling SAEs
Scaling Monosemanticity
Templeton et al., 2024
SAE methods at frontier scale; which feature types scale with dictionary size.
-
Evaluation
SAEBench
Karvonen et al., ICML 2025
Benchmark suite for comparing SAE quality across methods with shared evaluation tasks.
-
Evaluation
SynthSAEBench
Chanin and Garriga-Alonso, 2026
Controlled synthetic settings to separate true representation recovery from metric illusions.
-
Circuit Tracing
Circuit Tracing: Revealing Computational Graphs
Anthropic, 2025
Graph-level attribution maps capturing pathways across many layers.
-
Circuit Tracing
On the Biology of a Large Language Model
Anthropic, 2025
Attribution-graph tools reveal recurring computational motifs inside a production-scale model.
-
Model Editing
Locating and Editing Factual Associations in GPT (ROME)
Meng et al., 2022
Edits a targeted factual association by patching a narrow parameter region.
-
Model Editing
Mass-Editing Memory in a Transformer (MEMIT)
Meng et al., 2022
Scales targeted factual editing from one memory to many while preserving unrelated behavior.
-
Automated Interpretability
Language Models Can Explain Neurons
Bills et al., OpenAI, 2023
LMs generate and score neuron explanations, partially automating a previously manual workflow.
-
Geometry
The Information Geometry of Softmax
Park et al., 2026
Behavior-preserving interventions should respect information geometry, not only Euclidean distance.
-
SAE Objective
Interpretability as Compression (MDL-SAEs)
Ayonrinde, Pearce, and Sharkey, 2024
SAE quality reframed as description length, favoring concise explanations over sparsity alone.
-
Attention + Transport
Attention as One-Sided Entropic Optimal Transport
Litman, 2025
Standard attention derived as an entropic transport solve.
-
Attention + Transport
You Need Better Attention Priors (GOAT)
Litman and Guo, 2026
Generalizes attention with learnable transport priors to address sink behavior.