paper
active
2023
17
paper:arxiv-2310-06824

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

TL;DR

At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.

What to take away

  1. 1. LLaMA-2-70B and 13B, but not 7B, show clear linear separation of true and false statements in the top two PCA dimensions of residual stream activations at group (b) hidden states (e.g., layer 15 over end-of-sentence punctuation in LLaMA-2-13B).
  2. 2. Mass-mean probing (MM), which uses the difference-in-means direction with an optional covariance correction, outperforms logistic regression (LR) and contrast-consistent search (CCS) on normalized indirect effect (NIE) in 7 of 8 causal intervention conditions, despite similar classification accuracy across methods.
  3. 3. The MM probe trained on cities+neg_cities achieves NIEs of 0.85 (false→true) and 0.97 (true→false) on sp_en_trans for LLaMA-2-13B, compared to LR NIEs of 0.33 and 0.52 respectively.
  4. 4. Probes trained on larger_than+smaller_than generalize to >95% accuracy on sp_en_trans for both LLaMA-2-13B and LLaMA-2-70B, demonstrating cross-topic transfer of a linearly-encoded truth direction.
  5. 5. Patching experiments on LLaMA-2-13B identify three groups of causally-implicated hidden states: group (a) encoding entity representations, group (b) encoding statement-level truth above end-of-sentence punctuation, and group (c) directly driving the TRUE/FALSE output logits.
  6. 6. Probes trained on the likely dataset — nonfactual text where the final token is the most or 100th most probable completion per LLaMA-13B — perform worse than chance on datasets with anti-correlations between truth and text probability (e.g., neg_cities, where r = −0.63), ruling out probability-of-text as the underlying represented feature.
  7. 7. For LLaMA-2-13B, cities and neg_cities representations transition from antipodal alignment in early layers, through orthogonal separation at intermediate layers, to shared-axis alignment in later layers, suggesting a hierarchical emergence from surface features (e.g., 'close association') to abstract truth.
  8. 8. A replicable methodology: activations are extracted at the most-downstream group (b) hidden state without a few-shot prompt, centered by subtracting the dataset mean, and projected via PCA; probes are trained on an 80/20 split with full out-of-distribution test sets evaluated on 100% of held-out data.
  9. 9. An open hypothesis raised is why MM probe directions extracted from the likely dataset produce surprisingly effective causal interventions despite those probes classifying true/false statements at near-chance accuracy, suggesting the direction may capture a causally relevant feature independent of classification performance.
  10. 10. Calibrated few-shot prompting is a surprisingly weak baseline for classifying statement truth, underperforming linear probes trained on in-distribution data across multiple LLaMA-2 model sizes and test sets.

Peer brief — for seminar discussion

Working with the LLaMA-2 family (7B, 13B, and 70B), Marks and Tegmark investigate whether LLMs encode a geometrically coherent, causally active representation of factual truth. They assemble 12 datasets spanning curated templates (cities, 1,496 rows; sp_en_trans, 354 rows; larger_than/smaller_than, 1,980 rows each), logical variants (negations, conjunctions, disjunctions), uncurated benchmarks from Azaria & Mitchell and Casper et al., and a novel likely dataset of nonfactual text designed to dissociate truth from text probability. The analytic pipeline combines PCA visualizations of residual stream activations at causally-identified hidden states (group (b), localized via patching — e.g., layer 15 over end-of-sentence punctuation in LLaMA-2-13B), cross-dataset probe transfer experiments, and causal interventions that shift activations along probe-identified directions to flip model truth judgments on out-of-distribution inputs. The load-bearing finding is that LLaMA-2-13B and 70B, but not 7B, linearly represent truth in a direction that generalizes across topically and structurally disparate datasets: probes trained on larger_than+smaller_than exceed 95% accuracy on sp_en_trans for both 13B and 70B without any fine-tuning or domain overlap. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means vector between true and false activations, optionally correcting for covariance (equivalent to linear discriminant analysis on IID data). MM achieves normalized indirect effects (NIEs) of 0.85/0.97 (false→true / true→false) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training, versus LR NIEs of 0.33/0.52 — a large causal advantage despite near-identical classification accuracy. An alternative method that could have been used is contrast-consistent search (CCS; Burns et al., 2023), which is included as a comparison and consistently underperforms MM on causal metrics while matching it on accuracy. The paper also shows that representations in early layers of LLaMA-2-13B for cities and neg_cities are antipodally aligned before rotating to orthogonality and finally to shared-axis alignment in later layers, consistent with a hypothesis of hierarchical abstraction from surface features to a general truth concept. The central implication is that truth has a real geometric foothold in large transformers, not merely as a classification artifact but as a causally manipulable direction — which has direct relevance to mechanistic interpretability and to schemes for detecting or eliciting honest behavior. A critical reader would push back on the scope restriction: the paper deliberately limits analysis to simple, unambiguous, uncontroversial factual statements and explicitly acknowledges it cannot disambiguate 'true' from 'commonly believed,' 'verifiable,' or 'uncontroversial.' This means the identified direction may be a representation of epistemic certainty or familiarity rather than truth per se, and the entire empirical architecture is designed to sidestep exactly the hard cases — contested facts, deceptive outputs, opinion — where the practical stakes are highest. Whether the linearly-represented direction found on cities and larger_than survives on genuinely contested or multi-step reasoning statements remains an open question, and the restriction to the LLaMA-2 family means generalization to other architectures or training regimes is untested.

Methods (1)

  • Contrast-Consistent Search
    Unsupervised probing method from Burns et al. 2023 that identifies directions along which contrast pair representations are far apart

Frameworks (4)

  • Eliciting Latent Knowledge (ELK)
    Christiano et al. (2021) framework motivating the problem of determining whether a model 'believes' a statement; cited as core motivation
  • Linear World Models in LLMs
    Prior work framework studying whether LLMs encode world models as linear structures in their representations
  • Mass-Mean Probing
    Introduced in this paper: an optimization-free probing technique using difference-in-means direction with optional covariance correction
  • Superposition Hypothesis
    Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition

Datasets (12)

Findings (22)

Claims (12)

Questions (7)

Original abstract (expand)

Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+25 more

Similar preprints — Semantic Scholar

Cited by (5)