The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

BySamuel Marks·Max Tegmark ⓘMit, Northeastern University

DOI 10.48550/arxiv.2310.06824 arXiv 2310.06824 OpenAlex W4387561538

Causal Mediation Analysis Eliciting Latent Knowledge (ELK)Contrast-Consistent Search cities_cities_conj dataset Contrast-Consistent Search (CCS)Linear World Models in LLMs cities_cities_disj dataset Contrast Pairs Mass-Mean Probing cities dataset Factuality Superposition Hypothesis common_claim_true_false dataset Feature Interference+19 more

TL;DR

At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.

What to take away

1. LLaMA-2-70B and 13B, but not 7B, show clear linear separation of true and false statements in the top two PCA dimensions of residual stream activations at group (b) hidden states (e.g., layer 15 over end-of-sentence punctuation in LLaMA-2-13B).
2. Mass-mean probing (MM), which uses the difference-in-means direction with an optional covariance correction, outperforms logistic regression (LR) and contrast-consistent search (CCS) on normalized indirect effect (NIE) in 7 of 8 causal intervention conditions, despite similar classification accuracy across methods.
3. The MM probe trained on cities+neg_cities achieves NIEs of 0.85 (false→true) and 0.97 (true→false) on sp_en_trans for LLaMA-2-13B, compared to LR NIEs of 0.33 and 0.52 respectively.
4. Probes trained on larger_than+smaller_than generalize to >95% accuracy on sp_en_trans for both LLaMA-2-13B and LLaMA-2-70B, demonstrating cross-topic transfer of a linearly-encoded truth direction.
5. Patching experiments on LLaMA-2-13B identify three groups of causally-implicated hidden states: group (a) encoding entity representations, group (b) encoding statement-level truth above end-of-sentence punctuation, and group (c) directly driving the TRUE/FALSE output logits.
6. Probes trained on the likely dataset — nonfactual text where the final token is the most or 100th most probable completion per LLaMA-13B — perform worse than chance on datasets with anti-correlations between truth and text probability (e.g., neg_cities, where r = −0.63), ruling out probability-of-text as the underlying represented feature.
7. For LLaMA-2-13B, cities and neg_cities representations transition from antipodal alignment in early layers, through orthogonal separation at intermediate layers, to shared-axis alignment in later layers, suggesting a hierarchical emergence from surface features (e.g., 'close association') to abstract truth.
8. A replicable methodology: activations are extracted at the most-downstream group (b) hidden state without a few-shot prompt, centered by subtracting the dataset mean, and projected via PCA; probes are trained on an 80/20 split with full out-of-distribution test sets evaluated on 100% of held-out data.
9. An open hypothesis raised is why MM probe directions extracted from the likely dataset produce surprisingly effective causal interventions despite those probes classifying true/false statements at near-chance accuracy, suggesting the direction may capture a causally relevant feature independent of classification performance.
10. Calibrated few-shot prompting is a surprisingly weak baseline for classifying statement truth, underperforming linear probes trained on in-distribution data across multiple LLaMA-2 model sizes and test sets.

Peer brief — for seminar discussion

Working with the LLaMA-2 family (7B, 13B, and 70B), Marks and Tegmark investigate whether LLMs encode a geometrically coherent, causally active representation of factual truth. They assemble 12 datasets spanning curated templates (cities, 1,496 rows; sp_en_trans, 354 rows; larger_than/smaller_than, 1,980 rows each), logical variants (negations, conjunctions, disjunctions), uncurated benchmarks from Azaria & Mitchell and Casper et al., and a novel likely dataset of nonfactual text designed to dissociate truth from text probability. The analytic pipeline combines PCA visualizations of residual stream activations at causally-identified hidden states (group (b), localized via patching — e.g., layer 15 over end-of-sentence punctuation in LLaMA-2-13B), cross-dataset probe transfer experiments, and causal interventions that shift activations along probe-identified directions to flip model truth judgments on out-of-distribution inputs. The load-bearing finding is that LLaMA-2-13B and 70B, but not 7B, linearly represent truth in a direction that generalizes across topically and structurally disparate datasets: probes trained on larger_than+smaller_than exceed 95% accuracy on sp_en_trans for both 13B and 70B without any fine-tuning or domain overlap. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means vector between true and false activations, optionally correcting for covariance (equivalent to linear discriminant analysis on IID data). MM achieves normalized indirect effects (NIEs) of 0.85/0.97 (false→true / true→false) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training, versus LR NIEs of 0.33/0.52 — a large causal advantage despite near-identical classification accuracy. An alternative method that could have been used is contrast-consistent search (CCS; Burns et al., 2023), which is included as a comparison and consistently underperforms MM on causal metrics while matching it on accuracy. The paper also shows that representations in early layers of LLaMA-2-13B for cities and neg_cities are antipodally aligned before rotating to orthogonality and finally to shared-axis alignment in later layers, consistent with a hypothesis of hierarchical abstraction from surface features to a general truth concept. The central implication is that truth has a real geometric foothold in large transformers, not merely as a classification artifact but as a causally manipulable direction — which has direct relevance to mechanistic interpretability and to schemes for detecting or eliciting honest behavior. A critical reader would push back on the scope restriction: the paper deliberately limits analysis to simple, unambiguous, uncontroversial factual statements and explicitly acknowledges it cannot disambiguate 'true' from 'commonly believed,' 'verifiable,' or 'uncontroversial.' This means the identified direction may be a representation of epistemic certainty or familiarity rather than truth per se, and the entire empirical architecture is designed to sidestep exactly the hard cases — contested facts, deceptive outputs, opinion — where the practical stakes are highest. Whether the linearly-represented direction found on cities and larger_than survives on genuinely contested or multi-step reasoning statements remains an open question, and the restriction to the LLaMA-2 family means generalization to other architectures or training regimes is untested.

Methods (1)

Contrast-Consistent Search
Unsupervised probing method from Burns et al. 2023 that identifies directions along which contrast pair representations are far apart

Frameworks (4)

Eliciting Latent Knowledge (ELK)
Christiano et al. (2021) framework motivating the problem of determining whether a model 'believes' a statement; cited as core motivation
Linear World Models in LLMs
Prior work framework studying whether LLMs encode world models as linear structures in their representations
Mass-Mean Probing
Introduced in this paper: an optimization-free probing technique using difference-in-means direction with optional covariance correction
Superposition Hypothesis
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition

Datasets (12)

cities_cities_conj dataset
Conjunctions of two cities statements with 'and'; 1500 rows
cities_cities_disj dataset
Disjunctions of two cities statements with 'or'; 1500 rows
cities dataset
Curated dataset of statements 'The city of [city] is in [country]'; 1496 rows
common_claim_true_false dataset
Uncurated dataset of various claims from Casper et al. 2023; 4450 rows
companies_true_false dataset
Uncurated dataset of company claims from Azaria & Mitchell 2023; 1200 rows
counterfact_true_false dataset
Uncurated dataset of factual recall claims from Meng et al. 2022; 31960 rows
larger_than dataset
Statements of the form 'x is larger than y' for x,y in 51-99; 1980 rows
likely dataset
Nonfactual text where final token is either most or 100th most likely per LLaMA-13B; used to distinguish truth from text probability
neg_cities dataset
Negations of cities statements using 'not'; 1496 rows
neg_sp_en_trans dataset
Negations of sp_en_trans statements; 354 rows
smaller_than dataset
Statements of the form 'x is smaller than y'; antipodal to larger_than in 13B but aligns in 70B
sp_en_trans dataset
Curated dataset of Spanish-to-English translation statements; 354 rows

Findings (22)

PCA visualizations of LLaMA-2-13B and 70B representations of curated datasets show clear linear structure, with true statements separating from false ones in the top two principal components
Primary visual evidence for linear truth representations in large LLMs
In LLaMA-2-13B, salient linear structure in the top PCs rapidly emerges in early-middle layers, with this emergence occurring later for conjunctive statements than simple statements
Layer-wise emergence pattern supporting hierarchical development hypothesis
Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictions
Localizes truth representations to specific hidden states, motivating the rest of the analysis
In LLaMA-2-7B, PCA of larger_than+smaller_than shows statements clustering by surface-level characteristics (e.g., presence of token 'eighty') rather than truth value
Shows absence of abstract truth representations in smallest model, supporting scale-dependent emergence claim
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_trans
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
In LLaMA-2-13B, cities and neg_cities show antipodal alignment in early layers, rotate to orthogonal in middle layers, then eventually align in later layers
Layer-by-layer evolution of truth direction alignment, supporting hierarchical abstraction hypothesis
For Gaussian data with homoscedastic class-conditional distributions, IID mass-mean probing coincides with logistic regression (Theorem F.1)
Formal result establishing the theoretical connection between mass-mean probing and LR
In LLaMA-2-13B, larger_than and smaller_than separate along antipodal directions in PCA; in LLaMA-2-70B they align along a common direction
Scale-dependent alignment result demonstrating how more abstract truth representations emerge with scale
LLaMA-2-70B displays summarization behavior over punctuation tokens in a context-dependent way: present for cities but not for sp_en_trans
Contrasts with 7B and 13B which show consistent summarization behavior; may complicate localization at 70B scale
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probes
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans

Claims (12)

In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truth
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
Logistic regression fails to identify the true feature direction when a confounding feature is non-orthogonal to the truth direction, converging instead to the maximum margin separator
Motivates the introduction of mass-mean probing as an alternative to LR
As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputs
Interpretive claim connecting scale to abstraction level in LLM representations
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
Establishes that the observed linear structure is not merely a representation of text probability
The difference-in-means direction is the unique nullity-1 projection kernel that eliminates all linearly-recoverable binary classification information from a dataset
Formal consequence of Belrose et al. (2023) Theorem G.1 connecting mass-mean probing to optimal linear concept erasure
LLMs sometimes know statements are false but generate them anyway, motivating the need for techniques that inspect internal model state rather than outputs alone
Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings
LLMs hierarchically develop understanding of their input data, progressing from surface-level features in early layers to more abstract concepts in later layers
Interpretation of the layer-by-layer PCA visualizations showing linear structure emerging in early-middle layers
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Antipodal alignment between related datasets (e.g., larger_than and smaller_than) in smaller models resolves to common-direction alignment in larger models
Scale-dependent structural finding from PCA visualizations in §4
In early layers, LLaMA-2-13B represents a 'close association' feature that correlates with truth on cities but anti-correlates on neg_cities
Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers

Hypotheses (3)

We hypothesize that the layer-wise emergence of linear structure is due to LLMs hierarchically developing understanding of their input data, progressing from surface level features to more abstract concepts
Stated explicitly in App. C to explain why linear structure emerges later for conjunctive statements
We hypothesize that the layer-dependent emergence of linear structure is due to LLMs hierarchically developing understanding of input data, progressing from surface features to more abstract concepts
Offered to explain pattern observed in App.C layer-by-layer PCA analysis
We hypothesize that group (b) hidden states store a representation of the statement's truth
Motivating hypothesis driving the remainder of the paper's analysis after patching localization

Questions (7)

Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?
Open question raised in §7.1 about an unexplained anomalous result
Why did mass-mean probing with cities+neg_cities perform poorly for the 70B model, despite mass-mean probing with larger_than+smaller_than performing well?
Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
Can truth representations be disambiguated from closely related features such as 'commonly believed' or 'verifiable' using simple factual statements?
Acknowledged limitation: simple uncontroversial statements cannot distinguish truth from related epistemic features
Why did mass-mean probing with cities+neg_cities training data perform poorly for the 70B model, despite larger_than+smaller_than performing well?
Open question about scale-dependent asymmetry in training data effects
Can we disambiguate truth from closely related features such as 'commonly believed' or 'verifiable'?
Limitation noted in §7.1: scope restricted to simple statements prevents disambiguation
Do LLMs have a unified representation of truth that spans structurally and topically diverse data?
Central research question driving dataset design and experimental approach
Given a language model M and a statement s, does M believe s to be true?
The core motivating question of the paper, framed by Christiano et al. (2021)

Original abstract (expand)

Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 89%
Can LLMs Lie? Investigation beyond Hallucination
Mihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan
2025
≈ 86%
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Arunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks Richard Ren
2026
≈ 85%
StreetMath: Study of LLMs' Approximation Behaviors
Somshubhra Roy, Maisha Thasin, Danyang Zhang, and Blessing Effiong Chiung-Yi Tseng
2025
≈ 84%
Probing for Knowledge Attribution in Large Language Models
Alexander Boer, Dennis Ulmer Ivo Brink
2026
≈ 84%
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
in corpus
2025
≈ 84%
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
Shai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor Tomer Ashuach
2026
≈ 84%
Closing the Confidence-Faithfulness Gap in Large Language Models
Lyle Ungar Miranda Muqing Miao
2026
≈ 84%
Evaluating Large Language Models in Theory of Mind Tasks
Michal Kosinski
2024
≈ 84%
The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models
Zekun Wu, Kleyton Da Costa, Adriano Koshiyama Seonglae Cho
2026
≈ 84%
Mechanistic Interpretability in the Presence of Architectural Obfuscation
Thomas Barton Marcos Florencio
2025
≈ 84%
Discovering and Reasoning of Causality in the Hidden World with Large Language Models
Yongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang Chenxi Liu
2025
≈ 84%
LLM Assertiveness can be Mechanistically Decomposed into Emotional and Logical Components
Arush Tagade Hikaru Tsujimura
2025
≈ 83%
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri Krishak Aneja
2026
≈ 83%
LLMs Know More About Numbers than They Can Say
Li Du, Jason Eisner Fengting Yuchi
2026
≈ 83%
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Xianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma Tianlong Wang
2025
≈ 83%
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
in corpus
2025
≈ 83%
Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility
Jennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick Michael A. Lepori
2026
≈ 83%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 83%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 82%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 81%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 81%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 81%
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
in corpus
2026
≈ 81%
Model Alignment Search
in corpus
2025
≈ 81%
The Platonic Representation Hypothesis
in corpus
2024
≈ 81%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 80%
Neural natural language inference models partially embed theories of lexical entailment and negation
cited
2020
≈ 78%
Causal analysis of syntactic agreement mechanisms in neural language models
cited
2021
≈ 74%
Inference-time intervention: eliciting truthful answers from a language model
cited
2023
≈ 69%

+25 more

Similar preprints — Semantic Scholar

Cited by (5)

Testing the Limits of Truth Directions in LLMs
Linear truth directions in LLMs are reliable primarily for simple factual retrieval and break down as soon as truth assessment requires tracking intermediate results—a finding that sharply constrains
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as