finding

active

finding:for-llama-2-70b-probes-trained-on-larger-than-smaller-than-achieve-95-accuracy-on-sp-en-trans-regardless-of-probing-technique

For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing technique

Striking cross-domain generalization result supporting the claim that larger models represent abstract truth

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (3)

claim

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
associated_withsupports
Establishes that the observed linear structure is not merely a representation of text probability
As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputs
supports
Interpretive claim connecting scale to abstraction level in LLM representations
Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlations
associated_with
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy

Questions (1)

question

Do LLMs have a unified representation of truth that spans structurally and topically diverse data?
answered_by
Central research question driving dataset design and experimental approach

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.897
Larger models linearly represent more general concepts including truth
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.835
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.808
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.808
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.804
Shows behavioral pattern of self-correction is trainable in smaller models
Mass-mean probes generalize about as well as LR and CCS for LLaMA-2-13B and 70Bfinding0.800
Despite being simpler and optimization-free, MM probes match accuracy of other techniques at scale
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.800
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.796
Model-specific difference in persona susceptibility