finding

active

finding:mm-probes-trained-on-larger-than-smaller-than-achieve-lower-nie-than-those-trained-on-cities-neg-cities-despite-higher-classification-accuracy-on-sp-en-trans

MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_trans

Dissociation between classification accuracy and causal implication; training on opposites does not always help causally

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (1)

claim

Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlations
contradicts
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Why did mass-mean probing with cities+neg_cities training data perform poorly for the 70B model, despite larger_than+smaller_than performing well?question0.841
Open question about scale-dependent asymmetry in training data effects
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.840
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Why did mass-mean probing with cities+neg_cities perform poorly for the 70B model, despite mass-mean probing with larger_than+smaller_than performing well?question0.828
Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing techniquefinding0.808
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.780
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.claim0.779
Shows the key divide is passive vs. active framing, not the specific wording of instructions.
Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsfinding0.777
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthfinding0.776
Shows that truth representations are not reducible to text probability representations