Causal abstraction: A theoretical foundation for mechanistic interpretability

ByAtticus Geiger·Duligur Ibeling·Amir Zur·Maheep Chaudhary·Sonakshi Chauhan·Jing Huang+5 more

arXiv 2301.04709

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Causal Abstractions, Categorically Unified
Devendra Singh Dhami Markus Englberger
2025
≈ 81%
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Louis Jaburi Kola Ayonrinde
2025
≈ 81%
Combining Causal Models for More Accurate Abstractions of Neural Networks
Sara Magliacane, Atticus Geiger Theodora-Mara P\^islar
2025
≈ 80%
Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
Fengming Liu Zezheng Lin
2026
≈ 80%
Mechanistic?
Naomi Saphra and Sarah Wiegreffe
2024
≈ 79%
Propositional Interpretability in Artificial Intelligence
David J. Chalmers
2025
≈ 79%
Mechanistic Interpretability Needs Philosophy
Ninell Oldenburg, Ruchira Dhar, Joshua Hatherley, Constanza Fierro, Nina Rajcic, Sandrine R. Schiller, Filippos Stamatiou, Anders S{\o}gaard Iwan Williams
2025
≈ 78%
From Mechanistic to Compositional Interpretability
Thomas Dooms, Steven T. Holmer, Kola Ayonrinde, Geraint A. Wiggins Ward Gauderis
2026
≈ 78%
Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations
Ajay Pravin Mahale
2026
≈ 77%
Validating Mechanistic Interpretations: An Axiomatic Approach
Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha Nils Palumbo
2025
≈ 77%
Interpretability as Alignment: Making Internal Understanding a Design Principle
Pratinav Seth, Vinay Kumar Sankarapu Aadit Sengupta
2025
≈ 77%
Inference of Abstraction for a Unified Account of Symbolic Reasoning from Data
Hiroyuki Kido
2026
≈ 77%
A macro agent and its actions
Francesco Massari, Maggie Beheler-Amass and Giulio Tononi Larissa Albantakis
2020
≈ 77%
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Louis Jaburi Kola Ayonrinde
2025
≈ 76%
An Encoding of Abstract Dialectical Frameworks into Higher-Order Logic
Alexander Steen Antoine Martina
2026
≈ 76%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 72%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 71%
Multiple ways to implement and infer sentience
in corpus
≈ 71%
Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studies
in corpus
2023
≈ 70%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 70%
Finger Exercises in Formal Concept Analysis
in corpus
2006
≈ 70%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 69%
The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents
in corpus
2026
≈ 69%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 69%
2022-09-23_Prabros._dynamics-in-action-pdf1.pdf_2f6a2b
in corpus
≈ 68%
Denotational Design: from meanings to programs
in corpus
2015
≈ 68%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 68%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 68%

Similar preprints — Semantic Scholar

Cited by (6)

Addressing divergent representations from causal interventions on neural networks
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systemat
Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as