Not all language model features are one-dimensionally linear

ByJoshua Engels·Eric J Michaud·Isaac Liao·Wes Gurnee·Max Tegmark

Original abstract (expand)

Recent work has proposed that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Next, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we examine the continuity of the days of the week feature in Mistral 7B. Overall, our work argues that understanding multi-dimensional features is necessary to mechanistically decompose some model behaviors.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

A Survey of Large Language Models
Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen Wayne Xin Zhao
2026
≈ 77%
Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities
Sathvik Nair and Colin Phillips
2026
≈ 77%
The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling
Qinan Yu, Matianyu Zang, Carsten Eickhoff, Ellie Pavlick Ruochen Zhang
2024
≈ 76%
Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility
Jennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick Michael A. Lepori
2026
≈ 75%
Do Multilingual LLMs Think In English?
Yarin Gal and Sebastian Farquhar Lisa Schut
2025
≈ 75%
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models
Heike Adel, Lukas Lange, Yihong Liu, Ercong Nie, Jannik Str\"otgen, Hinrich Sch\"utze Mingyang Wang
2025
≈ 75%
Evaluating Neural Language Models as Cognitive Models of Language Acquisition
Annika Lea Heuser, Charles Yang, Jordan Kodner H\'ector Javier V\'azquez Mart\'inez
2026
≈ 75%
Singular Vectors of Attention Heads Align with Features
Carson Loughridge, Mark Crovella Gabriel Franco
2026
≈ 74%
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov Daniil Laptev
2025
≈ 74%
What do Language Models Learn and When? The Implicit Curriculum Hypothesis
Kaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja, Jen-tse Huang, Graham Neubig Emmy Liu
2026
≈ 74%
Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting
Roland M\"uhlenbernd
2026
≈ 74%
Controlling Chat Style in Language Models via Single-Direction Editing
Zhenyu Xu and Victor S. Sheng
2026
≈ 73%
Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives
Yu Wang and Emmanuele Chersoni and Chu-Ren Huang
2026
≈ 73%
Can Large Language Models Make Everyone Happy?
Gautam Siddharth Kashyap, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Rafiq Ali Usman Naseem
2026
≈ 73%
Semantic Convergence: Investigating Shared Representations Across Scaled LLMs
Sanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O'Brien Daniel Son
2025
≈ 73%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 70%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 68%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 68%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 67%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 67%
Living Things Are Not (20th Century) Machines: Updating Mechanism Metaphors in Light of the Modern Science of Machine Behavior
in corpus
2021
≈ 67%
Interpreting Language Model Parameters
in corpus
2026
≈ 66%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 66%
The Platonic Representation Hypothesis
in corpus
2024
≈ 66%
Persistence and Introspection of Emotion Features
in corpus
≈ 65%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 65%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 65%

Similar preprints — Semantic Scholar

Cited by (4)

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
Unveiling the Latent Directions of Reflection in Large Language Models
Reflection in LLMs corresponds to a recoverable latent direction in activation space, not merely a behavioral artifact of prompt engineering. Working with Qwen2.5-3B and Gemma3-4B-IT on the adversaria
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as