claim
active
claim:llms-linearly-represent-truth-relevant-information-beyond-the-plausibility-of-text-as-evidenced-by-probes-trained-on-likely-performing-poorly-on-anti-correlated-datasetsLLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
Establishes that the observed linear structure is not merely a representation of text probability
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (5)
finding
- For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing techniqueassociated_withsupportsStriking cross-domain generalization result supporting the claim that larger models represent abstract truth
- MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesassociated_withsupportsLikely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- PCA visualizations of LLaMA-2-13B and 70B representations of curated datasets show clear linear structure, with true statements separating from false ones in the top two principal componentsassociated_withsupportsPrimary visual evidence for linear truth representations in large LLMs
- Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthassociated_withsupportsShows that truth representations are not reducible to text probability representations
- Demonstrates strong anti-correlation between text probability and truth in negated datasets
Questions (4)
question
- Do LLMs have a unified representation of truth that spans structurally and topically diverse data?gatesCentral research question driving dataset design and experimental approach
- Acknowledged limitation: simple uncontroversial statements cannot distinguish truth from related epistemic features
- Can we disambiguate truth from closely related features such as 'commonly believed' or 'verifiable'?gatesLimitation noted in §7.1: scope restricted to simple statements prevents disambiguation
- The core motivating question of the paper, framed by Christiano et al. (2021)
Claims (2)
claim
- Interpretive claim connecting scale to abstraction level in LLM representations
- Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings
Datasets (1)
dataset
- likely datasetsupportsNonfactual text where final token is either most or 100th most likely per LLaMA-13B; used to distinguish truth from text probability
Methods (1)
method
- Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
- Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
- We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.hypothesis0.834Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.
- Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
- Central interpretive claim of the paper
- Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
- Claude 3 Opus ratings aligned with human judgment of feature descriptions.
- Core cross-modal empirical result: larger and better language models align better with vision models