paper:arxiv-2310-06824The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
TL;DR
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.
What to take away
- 1. LLaMA-2-70B and 13B, but not 7B, show clear linear separation of true and false statements in the top two PCA dimensions of residual stream activations at group (b) hidden states (e.g., layer 15 over end-of-sentence punctuation in LLaMA-2-13B).
- 2. Mass-mean probing (MM), which uses the difference-in-means direction with an optional covariance correction, outperforms logistic regression (LR) and contrast-consistent search (CCS) on normalized indirect effect (NIE) in 7 of 8 causal intervention conditions, despite similar classification accuracy across methods.
- 3. The MM probe trained on cities+neg_cities achieves NIEs of 0.85 (false→true) and 0.97 (true→false) on sp_en_trans for LLaMA-2-13B, compared to LR NIEs of 0.33 and 0.52 respectively.
- 4. Probes trained on larger_than+smaller_than generalize to >95% accuracy on sp_en_trans for both LLaMA-2-13B and LLaMA-2-70B, demonstrating cross-topic transfer of a linearly-encoded truth direction.
- 5. Patching experiments on LLaMA-2-13B identify three groups of causally-implicated hidden states: group (a) encoding entity representations, group (b) encoding statement-level truth above end-of-sentence punctuation, and group (c) directly driving the TRUE/FALSE output logits.
- 6. Probes trained on the likely dataset — nonfactual text where the final token is the most or 100th most probable completion per LLaMA-13B — perform worse than chance on datasets with anti-correlations between truth and text probability (e.g., neg_cities, where r = −0.63), ruling out probability-of-text as the underlying represented feature.
- 7. For LLaMA-2-13B, cities and neg_cities representations transition from antipodal alignment in early layers, through orthogonal separation at intermediate layers, to shared-axis alignment in later layers, suggesting a hierarchical emergence from surface features (e.g., 'close association') to abstract truth.
- 8. A replicable methodology: activations are extracted at the most-downstream group (b) hidden state without a few-shot prompt, centered by subtracting the dataset mean, and projected via PCA; probes are trained on an 80/20 split with full out-of-distribution test sets evaluated on 100% of held-out data.
- 9. An open hypothesis raised is why MM probe directions extracted from the likely dataset produce surprisingly effective causal interventions despite those probes classifying true/false statements at near-chance accuracy, suggesting the direction may capture a causally relevant feature independent of classification performance.
- 10. Calibrated few-shot prompting is a surprisingly weak baseline for classifying statement truth, underperforming linear probes trained on in-distribution data across multiple LLaMA-2 model sizes and test sets.
Peer brief — for seminar discussion
Working with the LLaMA-2 family (7B, 13B, and 70B), Marks and Tegmark investigate whether LLMs encode a geometrically coherent, causally active representation of factual truth. They assemble 12 datasets spanning curated templates (cities, 1,496 rows; sp_en_trans, 354 rows; larger_than/smaller_than, 1,980 rows each), logical variants (negations, conjunctions, disjunctions), uncurated benchmarks from Azaria & Mitchell and Casper et al., and a novel likely dataset of nonfactual text designed to dissociate truth from text probability. The analytic pipeline combines PCA visualizations of residual stream activations at causally-identified hidden states (group (b), localized via patching — e.g., layer 15 over end-of-sentence punctuation in LLaMA-2-13B), cross-dataset probe transfer experiments, and causal interventions that shift activations along probe-identified directions to flip model truth judgments on out-of-distribution inputs. The load-bearing finding is that LLaMA-2-13B and 70B, but not 7B, linearly represent truth in a direction that generalizes across topically and structurally disparate datasets: probes trained on larger_than+smaller_than exceed 95% accuracy on sp_en_trans for both 13B and 70B without any fine-tuning or domain overlap. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means vector between true and false activations, optionally correcting for covariance (equivalent to linear discriminant analysis on IID data). MM achieves normalized indirect effects (NIEs) of 0.85/0.97 (false→true / true→false) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training, versus LR NIEs of 0.33/0.52 — a large causal advantage despite near-identical classification accuracy. An alternative method that could have been used is contrast-consistent search (CCS; Burns et al., 2023), which is included as a comparison and consistently underperforms MM on causal metrics while matching it on accuracy. The paper also shows that representations in early layers of LLaMA-2-13B for cities and neg_cities are antipodally aligned before rotating to orthogonality and finally to shared-axis alignment in later layers, consistent with a hypothesis of hierarchical abstraction from surface features to a general truth concept. The central implication is that truth has a real geometric foothold in large transformers, not merely as a classification artifact but as a causally manipulable direction — which has direct relevance to mechanistic interpretability and to schemes for detecting or eliciting honest behavior. A critical reader would push back on the scope restriction: the paper deliberately limits analysis to simple, unambiguous, uncontroversial factual statements and explicitly acknowledges it cannot disambiguate 'true' from 'commonly believed,' 'verifiable,' or 'uncontroversial.' This means the identified direction may be a representation of epistemic certainty or familiarity rather than truth per se, and the entire empirical architecture is designed to sidestep exactly the hard cases — contested facts, deceptive outputs, opinion — where the practical stakes are highest. Whether the linearly-represented direction found on cities and larger_than survives on genuinely contested or multi-step reasoning statements remains an open question, and the restriction to the LLaMA-2 family means generalization to other architectures or training regimes is untested.
Methods (1)
- Contrast-Consistent SearchUnsupervised probing method from Burns et al. 2023 that identifies directions along which contrast pair representations are far apart
Frameworks (4)
- Eliciting Latent Knowledge (ELK)Christiano et al. (2021) framework motivating the problem of determining whether a model 'believes' a statement; cited as core motivation
- Linear World Models in LLMsPrior work framework studying whether LLMs encode world models as linear structures in their representations
- Mass-Mean ProbingIntroduced in this paper: an optimization-free probing technique using difference-in-means direction with optional covariance correction
- Superposition HypothesisCore theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
Datasets (12)
- cities_cities_conj datasetConjunctions of two cities statements with 'and'; 1500 rows
- cities_cities_disj datasetDisjunctions of two cities statements with 'or'; 1500 rows
- cities datasetCurated dataset of statements 'The city of [city] is in [country]'; 1496 rows
- common_claim_true_false datasetUncurated dataset of various claims from Casper et al. 2023; 4450 rows
- companies_true_false datasetUncurated dataset of company claims from Azaria & Mitchell 2023; 1200 rows
- counterfact_true_false datasetUncurated dataset of factual recall claims from Meng et al. 2022; 31960 rows
- larger_than datasetStatements of the form 'x is larger than y' for x,y in 51-99; 1980 rows
- likely datasetNonfactual text where final token is either most or 100th most likely per LLaMA-13B; used to distinguish truth from text probability
- neg_cities datasetNegations of cities statements using 'not'; 1496 rows
- neg_sp_en_trans datasetNegations of sp_en_trans statements; 354 rows
- smaller_than datasetStatements of the form 'x is smaller than y'; antipodal to larger_than in 13B but aligns in 70B
- sp_en_trans datasetCurated dataset of Spanish-to-English translation statements; 354 rows
Findings (22)
- PCA visualizations of LLaMA-2-13B and 70B representations of curated datasets show clear linear structure, with true statements separating from false ones in the top two principal components
Primary visual evidence for linear truth representations in large LLMs
- In LLaMA-2-13B, salient linear structure in the top PCs rapidly emerges in early-middle layers, with this emergence occurring later for conjunctive statements than simple statements
Layer-wise emergence pattern supporting hierarchical development hypothesis
- Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictions
Localizes truth representations to specific hidden states, motivating the rest of the analysis
- In LLaMA-2-7B, PCA of larger_than+smaller_than shows statements clustering by surface-level characteristics (e.g., presence of token 'eighty') rather than truth value
Shows absence of abstract truth representations in smallest model, supporting scale-dependent emergence claim
- MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_trans
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- In LLaMA-2-13B, cities and neg_cities show antipodal alignment in early layers, rotate to orthogonal in middle layers, then eventually align in later layers
Layer-by-layer evolution of truth direction alignment, supporting hierarchical abstraction hypothesis
- For Gaussian data with homoscedastic class-conditional distributions, IID mass-mean probing coincides with logistic regression (Theorem F.1)
Formal result establishing the theoretical connection between mass-mean probing and LR
- In LLaMA-2-13B, larger_than and smaller_than separate along antipodal directions in PCA; in LLaMA-2-70B they align along a common direction
Scale-dependent alignment result demonstrating how more abstract truth representations emerge with scale
- LLaMA-2-70B displays summarization behavior over punctuation tokens in a context-dependent way: present for cities but not for sp_en_trans
Contrasts with 7B and 13B which show consistent summarization behavior; may complicate localization at 70B scale
- MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probes
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Claims (12)
- In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truth
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
- Logistic regression fails to identify the true feature direction when a confounding feature is non-orthogonal to the truth direction, converging instead to the maximum margin separator
Motivates the introduction of mass-mean probing as an alternative to LR
- As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputs
Interpretive claim connecting scale to abstraction level in LLM representations
- LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
Establishes that the observed linear structure is not merely a representation of text probability
- The difference-in-means direction is the unique nullity-1 projection kernel that eliminates all linearly-recoverable binary classification information from a dataset
Formal consequence of Belrose et al. (2023) Theorem G.1 connecting mass-mean probing to optimal linear concept erasure
- LLMs sometimes know statements are false but generate them anyway, motivating the need for techniques that inspect internal model state rather than outputs alone
Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings
- LLMs hierarchically develop understanding of their input data, progressing from surface-level features in early layers to more abstract concepts in later layers
Interpretation of the layer-by-layer PCA visualizations showing linear structure emerging in early-middle layers
- Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- Antipodal alignment between related datasets (e.g., larger_than and smaller_than) in smaller models resolves to common-direction alignment in larger models
Scale-dependent structural finding from PCA visualizations in §4
- In early layers, LLaMA-2-13B represents a 'close association' feature that correlates with truth on cities but anti-correlates on neg_cities
Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
Hypotheses (3)
- We hypothesize that the layer-wise emergence of linear structure is due to LLMs hierarchically developing understanding of their input data, progressing from surface level features to more abstract concepts
Stated explicitly in App. C to explain why linear structure emerges later for conjunctive statements
- We hypothesize that the layer-dependent emergence of linear structure is due to LLMs hierarchically developing understanding of input data, progressing from surface features to more abstract concepts
Offered to explain pattern observed in App.C layer-by-layer PCA analysis
- We hypothesize that group (b) hidden states store a representation of the statement's truth
Motivating hypothesis driving the remainder of the paper's analysis after patching localization
Questions (7)
- Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?
Open question raised in §7.1 about an unexplained anomalous result
- Why did mass-mean probing with cities+neg_cities perform poorly for the 70B model, despite mass-mean probing with larger_than+smaller_than performing well?
Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
- Can truth representations be disambiguated from closely related features such as 'commonly believed' or 'verifiable' using simple factual statements?
Acknowledged limitation: simple uncontroversial statements cannot distinguish truth from related epistemic features
- Why did mass-mean probing with cities+neg_cities training data perform poorly for the 70B model, despite larger_than+smaller_than performing well?
Open question about scale-dependent asymmetry in training data effects
- Can we disambiguate truth from closely related features such as 'commonly believed' or 'verifiable'?
Limitation noted in §7.1: scope restricted to simple statements prevents disambiguation
- Do LLMs have a unified representation of truth that spans structurally and topically diverse data?
Central research question driving dataset design and experimental approach
- Given a language model M and a statement s, does M believe s to be true?
The core motivating question of the paper, framed by Christiano et al. (2021)
Original abstract (expand)
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 89%
- Can LLMs Lie? Investigation beyond HallucinationMihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan2025≈ 86%
- The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsArunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks Richard Ren2026≈ 85%
- StreetMath: Study of LLMs' Approximation BehaviorsSomshubhra Roy, Maisha Thasin, Danyang Zhang, and Blessing Effiong Chiung-Yi Tseng2025≈ 84%
- Probing for Knowledge Attribution in Large Language ModelsAlexander Boer, Dennis Ulmer Ivo Brink2026≈ 84%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 84%
- Masked by Consensus: Disentangling Privileged Knowledge in LLM CorrectnessShai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor Tomer Ashuach2026≈ 84%
- ≈ 84%
- ≈ 84%
- The Confidence Manifold: Geometric Structure of Correctness Representations in Language ModelsZekun Wu, Kleyton Da Costa, Adriano Koshiyama Seonglae Cho2026≈ 84%
- Mechanistic Interpretability in the Presence of Architectural ObfuscationThomas Barton Marcos Florencio2025≈ 84%
- Discovering and Reasoning of Causality in the Hidden World with Large Language ModelsYongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang Chenxi Liu2025≈ 84%
- LLM Assertiveness can be Mechanistically Decomposed into Emotional and Logical ComponentsArush Tagade Hikaru Tsujimura2025≈ 83%
- Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMsManas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri Krishak Aneja2026≈ 83%
- ≈ 83%
- Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations CategoriesXianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma Tianlong Wang2025≈ 83%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 83%
- Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event PlausibilityJennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick Michael A. Lepori2026≈ 83%
- ≈ 83%
- ≈ 82%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 81%
- ≈ 81%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 81%
- ≈ 81%
- Model Alignment Searchin corpus2025≈ 81%
- The Platonic Representation Hypothesisin corpus2024≈ 81%
- ≈ 80%
- Neural natural language inference models partially embed theories of lexical entailment and negationcited2020≈ 78%
- ≈ 74%
- ≈ 69%
+25 more
Similar preprints — Semantic Scholar
Cited by (5)
- Testing the Limits of Truth Directions in LLMs
Linear truth directions in LLMs are reliable primarily for simple factual retrieval and break down as soon as truth assessment requires tracking intermediate results—a finding that sharply constrains
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as