claim

active

claim:llms-linearly-represent-truth-relevant-information-beyond-the-plausibility-of-text-as-evidenced-by-probes-trained-on-likely-performing-poorly-on-anti-correlated-datasets

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets

Establishes that the observed linear structure is not merely a representation of text probability

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
contradictsintroduces

Findings (5)

finding

For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing technique
associated_withsupports
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probes
associated_withsupports
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
PCA visualizations of LLaMA-2-13B and 70B representations of curated datasets show clear linear structure, with true statements separating from false ones in the top two principal components
associated_withsupports
Primary visual evidence for linear truth representations in large LLMs
Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truth
associated_withsupports
Shows that truth representations are not reducible to text probability representations
For neg_cities, truth value and LLaMA-2-70B log probability correlate at r=-0.63; for neg_sp_en_trans at r=-0.89
supports
Demonstrates strong anti-correlation between text probability and truth in negated datasets

Questions (4)

question

Do LLMs have a unified representation of truth that spans structurally and topically diverse data?
gates
Central research question driving dataset design and experimental approach
Can truth representations be disambiguated from closely related features such as 'commonly believed' or 'verifiable' using simple factual statements?
gates
Acknowledged limitation: simple uncontroversial statements cannot distinguish truth from related epistemic features
Can we disambiguate truth from closely related features such as 'commonly believed' or 'verifiable'?
gates
Limitation noted in §7.1: scope restricted to simple statements prevents disambiguation
Given a language model M and a statement s, does M believe s to be true?
gates
The core motivating question of the paper, framed by Christiano et al. (2021)

Claims (2)

claim

As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputs
extends
Interpretive claim connecting scale to abstraction level in LLM representations
LLMs sometimes know statements are false but generate them anyway, motivating the need for techniques that inspect internal model state rather than outputs alone
supports
Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings

Datasets (1)

dataset

likely dataset
supports
Nonfactual text where final token is either most or 100th most likely per LLaMA-13B; used to distinguish truth from text probability

Methods (1)

method

Causal Intervention via Activation Shifting
supports
Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.855
Central empirical conclusion of the paper about the fundamental limits of truth directions.
In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truthclaim0.845
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.hypothesis0.834
Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.833
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
Truthful behavior in LLMs is not confined to a single linear axis; multiple orthogonal directions can independently mediate itclaim0.829
Central interpretive claim of the paper
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.829
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
Automated interpretability using LLMs can usefully score feature specificity.claim0.820
Claude 3 Opus ratings aligned with human judgment of feature descriptions.
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.820
Core cross-modal empirical result: larger and better language models align better with vision models