Transformer Circuits Thread

Anthropic's mechanistic interpretability research blog where this paper was published.

Neighborhood — ranked by edge-count

paper

concept

A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)
cites
Foundational mechanistic interpretability paper on transformer circuit analysis
Emergent Introspective Awareness in Large Language Models (Lindsey, 2025)
cites
Related work demonstrating LLM introspective capabilities with scale-dependent pattern paralleling ESR
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)
cites
Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)
cites
Foundational SAE mechanistic interpretability paper

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
cites
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

event

Sofroniew et al. 2026: Emotion Concepts and Function in a Large Language Model
cites
Transformer Circuits paper identifying emotion-concept representations influencing safety behaviors; key related work published April 2026