claim

active

claim:training-probes-on-statements-and-their-opposites-improves-generalization-by-mitigating-non-truth-features-with-opposite-sign-correlations

Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlations

Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
introduces

Findings (4)

finding

For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing technique
associated_with
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
For neg_cities, truth value and LLaMA-2-70B log probability correlate at r=-0.63; for neg_sp_en_trans at r=-0.89
supports
Demonstrates strong anti-correlation between text probability and truth in negated datasets
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_trans
contradicts
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
Training on cities+neg_cities improves OOD generalization, especially on neg_sp_en_trans
supports
Training on statements and their negations mitigates non-truth feature interference in probe directions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthfinding0.839
Shows that truth representations are not reducible to text probability representations
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.802
Key improvement in cross-task generalization enabled by explicit instruction framing.
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.798
Shows the passive vs. active divide is more important than the specific wording of instructions.
Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.claim0.793
Shows the key divide is passive vs. active framing, not the specific wording of instructions.
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.782
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.776
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.773
Selective pressure toward convergence via task generality
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.772
Out-of-domain generalization showing deception features track general representational honesty