Discovering variable binding circuitry with desiderata

ByXander Davies·Max Nadeau·Nikhil Prakash·Tamar Rott Shaham·David Bau

DOI 10.48550/arxiv.2307.03637 arXiv 2307.03637

Original abstract (expand)

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Certified Circuits: Stability Guarantees for Mechanistic Circuits
Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer Alaa Anani
2026
≈ 66%
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu Xuyang Ge
2024
≈ 66%
Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision Transformers
Wolfgang Stammer, Bernt Schiele, Jonas Fischer Nina \.Zukowska
2026
≈ 65%
Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path
Neha Sengar, Dongsoo Har Andres Saurez
2026
≈ 65%
Strong regulatory graphs
Patric Gustafsson and Ion Petre
2026
≈ 65%
BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection
Dana Arad, Itay Itzhak, Anja Reusch, Adi Simhi, Gal Kesten-Pomeranz, Yonatan Belinkov Yaniv Nikankin
2025
≈ 65%
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou Xu Wang
2025
≈ 65%
Discovering and Reasoning of Causality in the Hidden World with Large Language Models
Yongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang Chenxi Liu
2025
≈ 64%
Average Attention Transformers and Arithmetic Circuits
Lena Ehrmuth and Laura Strieker
2026
≈ 64%
Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek Dana Arad
2025
≈ 64%
How causal analysis can reveal autonomy in models of biological systems
Hyunju Kim, Sara I. Walker, Giulio Tononi and Larissa Albantakis William Marshall
2018
≈ 64%
Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees
Guy Katz, Shahaf Bassan Itamar Hadad
2026
≈ 63%
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits
Erblina Purelku, Johanna Vielhaben, Wojciech Samek, Sebastian Lapuschkin Maximilian Dreyer
2024
≈ 63%
Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition
Georgia Zhou, Yeshwanth Cherapanamjeri, Yaxuan Huang, Anobel Y. Odisho, Peter R. Carroll, Bin Yu Aliyah R. Hsu
2025
≈ 63%
Zoom In: An Introduction to Circuits
in corpus
2020
≈ 63%
Circuit Stability Characterizes Language Model Generalization
Alan Sun
2025
≈ 63%
Differentiable Logic Cellular Automata: From Game of Life to pattern generation with learned recurrent circuits
in corpus
≈ 63%
Learning without neurons in physical systems
in corpus
2022
≈ 63%
Multiple ways to implement and infer sentience
in corpus
≈ 61%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 61%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 61%
Model Alignment Search
in corpus
2025
≈ 61%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 60%
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
in corpus
2026
≈ 60%
Collective intelligence: A unifying concept for integrating biology across scales and substrates
in corpus
2024
≈ 59%
Opening the Hood of a Word Processor
in corpus
1984
≈ 59%
Active Inference: A Process Theory
in corpus
2017
≈ 59%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 59%

Similar preprints — Semantic Scholar

Cited by (2)

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as