paper:arxiv-2307-15054A geometric notion of causal probing
Original abstract (expand)
The linear subspace hypothesis (Bolukbasi et al., 2016) states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. Prior work has relied on auxiliary classification tasks to identify and evaluate candidate subspaces that might give support for this hypothesis. We instead give a set of intrinsic criteria which characterize an ideal linear concept subspace and enable us to identify the subspace using only the language model distribution. Our information-theoretic framework accounts for spuriously correlated features in the representation space (Kumar et al., 2022) by reconciling the statistical notion of concept information and the geometric notion of how concepts are encoded in the representation space. As a byproduct of this analysis, we hypothesize a causal process for how a language model might leverage concepts during generation. Empirically, we find that linear concept erasure is successful in erasing most concept information under our framework for verbal number as well as some complex aspect-level sentiment concepts from a restaurant review dataset. Our causal intervention for controlled generation shows that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Causal Probing for Internal Visual Representations in Multimodal Large Language ModelsTianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng2026≈ 70%
- ≈ 69%
- CausalSpatial: A Benchmark for Object-Centric Causal Spatial ReasoningChenlong Wang, Ruisheng Yuan, Hao Chen, Nanru Dai, S. Kevin Zhou, Yijun Yang, Alan Yuille, Jieneng Chen Wenxin Ma2026≈ 68%
- Hume's Representational Conditions for Causal Judgment: What Bayesian Formalization Abstracted AwayYiling Wu2026≈ 68%
- A physical approach to qualia and the emergence of conscious observers in qualia spacePedro Resende2025≈ 68%
- Inference Time Causal Probing in LLMsSaber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser Sadegh Khorasani2026≈ 68%
- From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMsLiner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang Jiyuan An2026≈ 67%
- ≈ 67%
- Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to SemanticsJoshua Lum, Ziyi Liu, Dani Yogatama Isabelle Lee2024≈ 67%
- Discovering and Reasoning of Causality in the Hidden World with Large Language ModelsYongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang Chenxi Liu2025≈ 67%
- Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal ClaimsFengming Liu Zezheng Lin2026≈ 67%
- Causal Discovery Should Embrace the Wisdom of the CrowdYuantao Wei, Huiling Liao, Xiaoning Qian, Shuai Huang Ryan Feng Lin2026≈ 67%
- ≈ 66%
- Empowerment Gain and Causal Model Construction: Children and adults are sensitive to controllability and variability in their causal interventionsKelsey Allen, Shiry Ginosar, and Alison Gopnik Eunice Yiu2026≈ 66%
- Causal Structure Learning: a Bayesian approach based on random graphsIvan R. Feliciano-Avelino, L. Enrique Sucar, Hugo J. Escalante Balderas Mauricio Gonzalez-Soto2026≈ 66%
- ≈ 65%
- ≈ 65%
- ≈ 65%
- ≈ 64%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 63%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 63%
- ≈ 63%
- The World Inside Neural Networksin corpus2026≈ 63%
- ≈ 63%
- ≈ 63%
- Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studiesin corpus2023≈ 62%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 62%
- ≈ 62%
- Cognitive glues are shared models of relative scarcities: the economics of collective intelligencein corpus2026≈ 62%
- ≈ 62%
Similar preprints — Semantic Scholar
Cited by (3)
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as