paper:arxiv-2405-14860Not all language model features are one-dimensionally linear
Original abstract (expand)
Recent work has proposed that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Next, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we examine the continuity of the days of the week feature in Mistral 7B. Overall, our work argues that understanding multi-dimensional features is necessary to mechanistically decompose some model behaviors.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- A Survey of Large Language ModelsKun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen Wayne Xin Zhao2026≈ 77%
- Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated ProbabilitiesSathvik Nair and Colin Phillips2026≈ 77%
- The Same But Different: Structural Similarities and Differences in Multilingual Language ModelingQinan Yu, Matianyu Zang, Carsten Eickhoff, Ellie Pavlick Ruochen Zhang2024≈ 76%
- Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event PlausibilityJennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick Michael A. Lepori2026≈ 75%
- ≈ 75%
- Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language ModelsHeike Adel, Lukas Lange, Yihong Liu, Ercong Nie, Jannik Str\"otgen, Hinrich Sch\"utze Mingyang Wang2025≈ 75%
- Evaluating Neural Language Models as Cognitive Models of Language AcquisitionAnnika Lea Heuser, Charles Yang, Jordan Kodner H\'ector Javier V\'azquez Mart\'inez2026≈ 75%
- Singular Vectors of Attention Heads Align with FeaturesCarson Loughridge, Mark Crovella Gabriel Franco2026≈ 74%
- Analyze Feature Flow to Enhance Interpretation and Steering in Language ModelsNikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov Daniil Laptev2025≈ 74%
- What do Language Models Learn and When? The Implicit Curriculum HypothesisKaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja, Jen-tse Huang, Graham Neubig Emmy Liu2026≈ 74%
- Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic PromptingRoland M\"uhlenbernd2026≈ 74%
- Controlling Chat Style in Language Models via Single-Direction EditingZhenyu Xu and Victor S. Sheng2026≈ 73%
- Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from DemonstrativesYu Wang and Emmanuele Chersoni and Chu-Ren Huang2026≈ 73%
- Can Large Language Models Make Everyone Happy?Gautam Siddharth Kashyap, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Rafiq Ali Usman Naseem2026≈ 73%
- Semantic Convergence: Investigating Shared Representations Across Scaled LLMsSanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O'Brien Daniel Son2025≈ 73%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 70%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 68%
- ≈ 68%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 67%
- ≈ 67%
- ≈ 67%
- Interpreting Language Model Parametersin corpus2026≈ 66%
- ≈ 66%
- The Platonic Representation Hypothesisin corpus2024≈ 66%
- ≈ 65%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 65%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 65%
Similar preprints — Semantic Scholar
Cited by (4)
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
- Unveiling the Latent Directions of Reflection in Large Language Models
Reflection in LLMs corresponds to a recoverable latent direction in activation space, not merely a behavioral artifact of prompt engineering. Working with Qwen2.5-3B and Gemma3-4B-IT on the adversaria
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as