A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)

Foundational mechanistic interpretability paper on transformer circuit analysis

Neighborhood — ranked by edge-count

paper

venue

Transformer Circuits Thread
cites
Anthropic's mechanistic interpretability research blog where this paper was published.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A Mathematical Framework for Transformer Circuitsframework0.939
Prior Anthropic paper enabling circuit-level analysis of attention-only transformers; motivates current MLP decomposition
Circuits Frameworkframework0.787
Mechanistic interpretability framework for understanding neural network computation as circuits of features
Learning to encode position for transformer with continuous dynamical model (Liu et al., 2020)concept0.769
Prior work on learned dynamic position encodings; cited alongside Wang et al. as precedent.
Transformer can be viewed as a Wolfram causal graph with foliations specifying computation order.claim0.760
Janus's interpretive framing of transformers as causal graphs.
transformer architectureframework0.757
Neural network architecture based on attention, commonly used in large language models
Transformers use an anti-Markovian solution that recomputes relevant numeric information at each step in the Multi-Object taskclaim0.750
Prior finding from Grant et al. 2025 used to interpret low MAS IIA for GRU-Transformer hidden state comparisons.
The transformer entity is tricameral (base simulator, simulated simulator, simulated awareness), but there is less discreteness between these layers than previously claimed.claim0.747
Antra's revision of her earlier model; still considers interference between levels important.
Transformer decoder architectureframework0.742
Base architecture of reasoning LLMs studied, with attention and MLP blocks per layer