finding

active

finding:probes-trained-on-the-likely-dataset-perform-worse-than-chance-on-datasets-with-anti-correlations-between-text-probability-and-truth

Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truth

Shows that truth representations are not reducible to text probability representations

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (1)

claim

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
associated_withsupports
Establishes that the observed linear structure is not merely a representation of text probability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlationsclaim0.839
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.814
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Probe-based data attribution effectively reduces harmful behaviors via data interventionsclaim0.791
Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?question0.790
Open question raised in §7.1 about an unexplained anomalous result
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.789
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.788
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.782
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Why did mass-mean probing with cities+neg_cities training data perform poorly for the 70B model, despite larger_than+smaller_than performing well?question0.777
Open question about scale-dependent asymmetry in training data effects