Causal abstractions of neural networks

ByAtticus Geiger·Hanson Lu·Thomas Icard·Christopher Potts

arXiv 2106.02997

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Causal Abstractions, Categorically Unified
Devendra Singh Dhami Markus Englberger
2025
≈ 78%
Combining Causal Models for More Accurate Abstractions of Neural Networks
Sara Magliacane, Atticus Geiger Theodora-Mara P\^islar
2025
≈ 76%
Conceptual Views of Neural Networks: A Framework for Neuro-Symbolic Analysis
Johannes Hirth and Tom Hanika
2026
≈ 74%
CausalARC: Abstract Reasoning with Causal World Models
John Kalantari, Kia Khezeli Jacqueline Maasch
2026
≈ 73%
The Function Representation of Artificial Neural Network
Zhongkui Ma
2026
≈ 72%
Inference of Abstraction for a Unified Account of Symbolic Reasoning from Data
Hiroyuki Kido
2026
≈ 72%
Causal Bayesian Networks for Data-driven Safety Analysis of Complex Systems
Lina Putze, Tjark Koopmann, Jan Reich, Christian Neurohr Roman Gansch
2025
≈ 72%
On the Mechanistic Interpretability of Neural Networks for Causality in Bio-statistics
Jean-Baptiste A. Conan
2025
≈ 72%
Generative artificial intelligence-enabled dynamic detection of nicotine-related circuits
Changhong Jing, Ye Li, Xinan Liu, Zuxin Chen, Shuqiang Wang Changwei Gong
2022
≈ 71%
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
Arya Datla, Ziv Goldfeld Jonathn Chang
2026
≈ 71%
Exploratory Causal Inference in SAEnce
Riccardo Cadei, Francesco Locatello Tommaso Mencattini
2026
≈ 71%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 71%
Causal Learner: A Toolbox for Causal Structure and Markov Blanket Learning
Kui Yu, Yiwen Zhang, Lin Liu, and Jiuyong Li Zhaolong Ling
2025
≈ 70%
Learning by Abstraction: The Neural State Machine
Drew A. Hudson and Christopher D. Manning
2019
≈ 70%
Abstracting Deep Neural Networks into Concept Graphs for Concept Level Interpretability
Parth Natekar, Ganapathy Krishnamurthi, Balaji Srinivasan Avinash Kori
2022
≈ 70%
Identifying Sub-networks in Neural Networks via Functionally Similar Representations
Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Dennis Wei Tian Gao
2025
≈ 70%
The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents
in corpus
2026
≈ 69%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 69%
Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studies
in corpus
2023
≈ 69%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 68%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 68%
The World Inside Neural Networks
in corpus
2026
≈ 66%
Brains and where else? Mapping theories of consciousness to unconventional embodiments
in corpus
2026
≈ 66%
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
in corpus
2026
≈ 65%
Multiple ways to implement and infer sentience
in corpus
≈ 65%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 65%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 65%
AI: a Bridge toward Diverse Intelligence and Humanity’s Future
in corpus
2024
≈ 65%

Similar preprints — Semantic Scholar

Cited by (9)

Addressing divergent representations from causal interventions on neural networks
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systemat
Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstractio
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Distributed alignment search (DAS) resolves two blocking limitations of prior causal abstraction work—brute-force alignment search and the localist assumption that high-level variables map to disjoint
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as