paper:arxiv-2307-03637Discovering variable binding circuitry with desiderata
Original abstract (expand)
Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Certified Circuits: Stability Guarantees for Mechanistic CircuitsTobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer Alaa Anani2026≈ 66%
- Automatically Identifying Local and Global Circuits with Linear Computation GraphsFukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu Xuyang Ge2024≈ 66%
- Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision TransformersWolfgang Stammer, Bernt Schiele, Jonas Fischer Nina \.Zukowska2026≈ 65%
- Circuit Fingerprints: How Answer Tokens Encode Their Geometrical PathNeha Sengar, Dongsoo Har Andres Saurez2026≈ 65%
- ≈ 65%
- BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge SelectionDana Arad, Itay Itzhak, Anja Reusch, Adi Simhi, Gal Kesten-Pomeranz, Yonatan Belinkov Yaniv Nikankin2025≈ 65%
- Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit AnalysisYan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou Xu Wang2025≈ 65%
- Discovering and Reasoning of Causality in the Hidden World with Large Language ModelsYongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang Chenxi Liu2025≈ 64%
- ≈ 64%
- Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language ModelsYonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek Dana Arad2025≈ 64%
- How causal analysis can reveal autonomy in models of biological systemsHyunju Kim, Sara I. Walker, Giulio Tononi and Larissa Albantakis William Marshall2018≈ 64%
- Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable GuaranteesGuy Katz, Shahaf Bassan Itamar Hadad2026≈ 63%
- PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant CircuitsErblina Purelku, Johanna Vielhaben, Wojciech Samek, Sebastian Lapuschkin Maximilian Dreyer2024≈ 63%
- Efficient Automated Circuit Discovery in Transformers using Contextual DecompositionGeorgia Zhou, Yeshwanth Cherapanamjeri, Yaxuan Huang, Anobel Y. Odisho, Peter R. Carroll, Bin Yu Aliyah R. Hsu2025≈ 63%
- Zoom In: An Introduction to Circuitsin corpus2020≈ 63%
- ≈ 63%
- ≈ 63%
- Learning without neurons in physical systemsin corpus2022≈ 63%
- ≈ 61%
- ≈ 61%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 61%
- Model Alignment Searchin corpus2025≈ 61%
- ≈ 60%
- ≈ 60%
- Collective intelligence: A unifying concept for integrating biology across scales and substratesin corpus2024≈ 59%
- Opening the Hood of a Word Processorin corpus1984≈ 59%
- Active Inference: A Process Theoryin corpus2017≈ 59%
- ≈ 59%
Similar preprints — Semantic Scholar
Cited by (2)
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as