A geometric notion of causal probing

ByClément Guerner·Anej Svete·Tianyu Liu·Alexander Warstadt·Ryan Cotterell

DOI 10.48550/arxiv.2307.15054 arXiv 2307.15054

Original abstract (expand)

The linear subspace hypothesis (Bolukbasi et al., 2016) states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. Prior work has relied on auxiliary classification tasks to identify and evaluate candidate subspaces that might give support for this hypothesis. We instead give a set of intrinsic criteria which characterize an ideal linear concept subspace and enable us to identify the subspace using only the language model distribution. Our information-theoretic framework accounts for spuriously correlated features in the representation space (Kumar et al., 2022) by reconciling the statistical notion of concept information and the geometric notion of how concepts are encoded in the representation space. As a byproduct of this analysis, we hypothesize a causal process for how a language model might leverage concepts during generation. Empirically, we find that linear concept erasure is successful in erasing most concept information under our framework for verbal number as well as some complex aspect-level sentiment concepts from a restaurant review dataset. Our causal intervention for controlled generation shows that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng
2026
≈ 70%
Quantifying Harm
Hana Chockler, Joseph Y. Halpern Sander Beckers
2026
≈ 69%
CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning
Chenlong Wang, Ruisheng Yuan, Hao Chen, Nanru Dai, S. Kevin Zhou, Yijun Yang, Alan Yuille, Jieneng Chen Wenxin Ma
2026
≈ 68%
Hume's Representational Conditions for Causal Judgment: What Bayesian Formalization Abstracted Away
Yiling Wu
2026
≈ 68%
A physical approach to qualia and the emergence of conscious observers in qualia space
Pedro Resende
2025
≈ 68%
Inference Time Causal Probing in LLMs
Saber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser Sadegh Khorasani
2026
≈ 68%
From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs
Liner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang Jiyuan An
2026
≈ 67%
Exploratory Causal Inference in SAEnce
Riccardo Cadei, Francesco Locatello Tommaso Mencattini
2026
≈ 67%
Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to Semantics
Joshua Lum, Ziyi Liu, Dani Yogatama Isabelle Lee
2024
≈ 67%
Discovering and Reasoning of Causality in the Hidden World with Large Language Models
Yongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang Chenxi Liu
2025
≈ 67%
Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
Fengming Liu Zezheng Lin
2026
≈ 67%
Causal Discovery Should Embrace the Wisdom of the Crowd
Yuantao Wei, Huiling Liao, Xiaoning Qian, Shuai Huang Ryan Feng Lin
2026
≈ 67%
Causal Abstractions, Categorically Unified
Devendra Singh Dhami Markus Englberger
2025
≈ 66%
Empowerment Gain and Causal Model Construction: Children and adults are sensitive to controllability and variability in their causal interventions
Kelsey Allen, Shiry Ginosar, and Alison Gopnik Eunice Yiu
2026
≈ 66%
Causal Structure Learning: a Bayesian approach based on random graphs
Ivan R. Feliciano-Avelino, L. Enrique Sucar, Hugo J. Escalante Balderas Mauricio Gonzalez-Soto
2026
≈ 66%
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
in corpus
2026
≈ 65%
Steering Along Manifolds to Control Neural Networks
in corpus
≈ 65%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 65%
Multiple ways to implement and infer sentience
in corpus
≈ 64%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 63%
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
in corpus
2025
≈ 63%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 63%
The World Inside Neural Networks
in corpus
2026
≈ 63%
The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents
in corpus
2026
≈ 63%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 63%
Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studies
in corpus
2023
≈ 62%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 62%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 62%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 62%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 62%

Similar preprints — Semantic Scholar

Cited by (3)

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as