The linear representation hypothesis and the geometry of large language models

ByK. Park·Y. J. Choe·V. Veitch

DOI 10.48550/arxiv.2311.03658 arXiv 2311.03658

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, Fazl Barez Michael Lan
2025
≈ 80%
Geospatial Mechanistic Interpretability of Large Language Models
Stefano Mizzaro, Kevin Roitero Stef De Sabbata
2025
≈ 80%
A Survey of Large Language Models
Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen Wayne Xin Zhao
2026
≈ 80%
Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, Isabelle Augenstein Haeun Yu
2026
≈ 80%
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
Run Shao, Dongyue Wu, Jiajie Teng, Chao Tao, Jingdong Chen, Haifeng Li Zhaoyang Zhang
2026
≈ 79%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 79%
Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities
Sathvik Nair and Colin Phillips
2026
≈ 79%
Advancing the Scientific Method with Large Language Models: From Hypothesis to Discovery
Sumeer A. Khan, Adnan Mahmud, Huck Yang, Alexander Lavin, Michael Levin, Jeremy Frey, Jared Dunnmon, James Evans, Alan Bundy, Saso Dzeroski, Jesper Tegner, Hector Zenil Yanbo Zhang
2025
≈ 79%
Do Multilingual LLMs Think In English?
Yarin Gal and Sebastian Farquhar Lisa Schut
2025
≈ 79%
LLMorphism: When humans come to see themselves as language models
Valerio Capraro
2026
≈ 78%
Towards Uncovering How Large Language Model Works: An Explainability Perspective
Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du Haiyan Zhao
2024
≈ 78%
Mechanistic Indicators of Understanding in Large Language Models
Pierre Beckmann and Matthieu Queloz
2026
≈ 78%
A geometric relation of the error introduced by sampling a language model's output distribution to its internal state
Albert F. Modenbach
2026
≈ 78%
Unsupervised Concept Vector Extraction for Bias Control in LLMs
Yangfeng Ji, David Evans Hannah Cyberey
2025
≈ 78%
Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Zubair Bashir, Procheta Sen Bhavik Chandna
2025
≈ 78%
From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs
Liner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang Jiyuan An
2026
≈ 78%
The Platonic Representation Hypothesis
in corpus
2024
≈ 75%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 73%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 73%
The World Inside Neural Networks
in corpus
2026
≈ 73%
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
in corpus
2026
≈ 73%
Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis
in corpus
2025
≈ 72%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 72%
Steering Along Manifolds to Control Neural Networks
in corpus
≈ 71%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 71%
Model Alignment Search
in corpus
2025
≈ 71%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 71%

Similar preprints — Semantic Scholar

Cited by (5)

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us
Unveiling the Latent Directions of Reflection in Large Language Models
Reflection in LLMs corresponds to a recoverable latent direction in activation space, not merely a behavioral artifact of prompt engineering. Working with Qwen2.5-3B and Gemma3-4B-IT on the adversaria
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a