paper:k-the-linear-representation-hypothesis-and-2024The linear representation hypothesis and the geometry of large language models
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Quantifying Feature Space Universality Across Large Language Models via Sparse AutoencodersPhilip Torr, Austin Meek, Ashkan Khakzar, David Krueger, Fazl Barez Michael Lan2025≈ 80%
- Geospatial Mechanistic Interpretability of Large Language ModelsStefano Mizzaro, Kevin Roitero Stef De Sabbata2025≈ 80%
- A Survey of Large Language ModelsKun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen Wayne Xin Zhao2026≈ 80%
- Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language ModelsSeogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, Isabelle Augenstein Haeun Yu2026≈ 80%
- The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?Run Shao, Dongyue Wu, Jiajie Teng, Chao Tao, Jingdong Chen, Haifeng Li Zhaoyang Zhang2026≈ 79%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 79%
- Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated ProbabilitiesSathvik Nair and Colin Phillips2026≈ 79%
- Advancing the Scientific Method with Large Language Models: From Hypothesis to DiscoverySumeer A. Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, Saso Dzeroski, Jesper Tegner, Hector Zenil Yanbo Zhang2025≈ 79%
- ≈ 79%
- ≈ 78%
- Towards Uncovering How Large Language Model Works: An Explainability PerspectiveFan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du Haiyan Zhao2024≈ 78%
- Mechanistic Indicators of Understanding in Large Language ModelsPierre Beckmann and Matthieu Queloz2026≈ 78%
- A geometric relation of the error introduced by sampling a language model's output distribution to its internal stateAlbert F. Modenbach2026≈ 78%
- Unsupervised Concept Vector Extraction for Bias Control in LLMsYangfeng Ji, David Evans Hannah Cyberey2025≈ 78%
- Dissecting Bias in LLMs: A Mechanistic Interpretability PerspectiveZubair Bashir, Procheta Sen Bhavik Chandna2025≈ 78%
- From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMsLiner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang Jiyuan An2026≈ 78%
- The Platonic Representation Hypothesisin corpus2024≈ 75%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 73%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 73%
- The World Inside Neural Networksin corpus2026≈ 73%
- ≈ 73%
- ≈ 72%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 72%
- ≈ 71%
- ≈ 71%
- Model Alignment Searchin corpus2025≈ 71%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 71%
Similar preprints — Semantic Scholar
Cited by (5)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
- Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us
- Unveiling the Latent Directions of Reflection in Large Language Models
Reflection in LLMs corresponds to a recoverable latent direction in activation space, not merely a behavioral artifact of prompt engineering. Working with Qwen2.5-3B and Gemma3-4B-IT on the adversaria
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a