Reading

Papers and posts that inform the research direction. Curated, not exhaustive.

Circuits Foundation
A Mathematical Framework for Transformer Circuits

Elhage et al., 2021

Linear-algebra toolbox for reasoning about attention and MLP computation as composable circuits.
Representation Geometry
Toy Models of Superposition

Elhage et al., 2022

Why neurons mix many concepts in one direction and when sparse features can still be recoverable.
Mechanistic Case Study
In-context Learning and Induction Heads

Olsson et al., 2022

Identifies a concrete head family explaining the algorithmic core of in-context next-token prediction.
Mechanistic Case Study
Interpretability in the Wild: IOI in GPT-2

Wang et al., 2022

Reverse-engineers a full multi-component language circuit rather than isolated heads or neurons.
Feature Discovery
Towards Monosemanticity

Bricken et al., 2023

The practical SAE workflow for extracting cleaner, more human-readable latent features.
Feature Discovery
Sparse Autoencoders Find Highly Interpretable Features

Templeton et al., 2023

SAE features can be interpreted and causally steered in a language model.
Scaling SAEs
Scaling Monosemanticity

Templeton et al., 2024

SAE methods at frontier scale; which feature types scale with dictionary size.
Evaluation
SAEBench

Karvonen et al., ICML 2025

Benchmark suite for comparing SAE quality across methods with shared evaluation tasks.
Evaluation
SynthSAEBench

Chanin and Garriga-Alonso, 2026

Controlled synthetic settings to separate true representation recovery from metric illusions.
Circuit Tracing
Circuit Tracing: Revealing Computational Graphs

Anthropic, 2025

Graph-level attribution maps capturing pathways across many layers.
Circuit Tracing
On the Biology of a Large Language Model

Anthropic, 2025

Attribution-graph tools reveal recurring computational motifs inside a production-scale model.
Model Editing
Locating and Editing Factual Associations in GPT (ROME)

Meng et al., 2022

Edits a targeted factual association by patching a narrow parameter region.
Model Editing
Mass-Editing Memory in a Transformer (MEMIT)

Meng et al., 2022

Scales targeted factual editing from one memory to many while preserving unrelated behavior.
Automated Interpretability
Language Models Can Explain Neurons

Bills et al., OpenAI, 2023

LMs generate and score neuron explanations, partially automating a previously manual workflow.
Geometry
The Information Geometry of Softmax

Park et al., 2026

Behavior-preserving interventions should respect information geometry, not only Euclidean distance.
SAE Objective
Interpretability as Compression (MDL-SAEs)

Ayonrinde, Pearce, and Sharkey, 2024

SAE quality reframed as description length, favoring concise explanations over sparsity alone.
Attention + Transport
Attention as One-Sided Entropic Optimal Transport

Litman, 2025

Standard attention derived as an entropic transport solve.
Attention + Transport
You Need Better Attention Priors (GOAT)

Litman and Guo, 2026

Generalizes attention with learnable transport priors to address sink behavior.