claim
active
claim:training-probes-on-statements-and-their-opposites-improves-generalization-by-mitigating-non-truth-features-with-opposite-sign-correlationsTraining probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlations
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (4)
finding
- For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing techniqueassociated_withStriking cross-domain generalization result supporting the claim that larger models represent abstract truth
- Demonstrates strong anti-correlation between text probability and truth in negated datasets
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- Training on statements and their negations mitigates non-truth feature interference in probe directions
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows that truth representations are not reducible to text probability representations
- Key improvement in cross-task generalization enabled by explicit instruction framing.
- Shows the passive vs. active divide is more important than the specific wording of instructions.
- Shows the key divide is passive vs. active framing, not the specific wording of instructions.
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Selective pressure toward convergence via task generality
- Out-of-domain generalization showing deception features track general representational honesty