finding

active

finding:mass-mean-probes-generalize-about-as-well-as-lr-and-ccs-for-llama-2-13b-and-70b

Mass-mean probes generalize about as well as LR and CCS for LLaMA-2-13B and 70B

Despite being simpler and optimization-free, MM probes match accuracy of other techniques at scale

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (1)

claim

Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs
supports
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.837
Larger models linearly represent more general concepts including truth
For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing techniquefinding0.800
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.789
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
LLaMA-2-7B representations of larger_than+smaller_than cluster by surface-level characteristics such as presence of token 'eighty'finding0.782
Demonstrates that small models represent surface features rather than abstract truth
Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsfinding0.776
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.776
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.775
Model-specific difference in persona susceptibility
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.774
Illustrative finding that ESR mitigates but does not fully eliminate steering influence