finding
active
finding:for-llama-2-70b-probes-trained-on-larger-than-smaller-than-achieve-95-accuracy-on-sp-en-trans-regardless-of-probing-techniqueFor LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing technique
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Claims (3)
claim
- LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsassociated_withsupportsEstablishes that the observed linear structure is not merely a representation of text probability
- Interpretive claim connecting scale to abstraction level in LLM representations
- Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
Questions (1)
question
- Do LLMs have a unified representation of truth that spans structurally and topically diverse data?answered_byCentral research question driving dataset design and experimental approach
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Larger models linearly represent more general concepts including truth
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- Shows behavioral pattern of self-correction is trainable in smaller models
- Despite being simpler and optimization-free, MM probes match accuracy of other techniques at scale
- Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
- Model-specific difference in persona susceptibility