finding

active

finding:llama-2-70b-and-13b-probes-generalize-better-across-datasets-than-7b-probes-across-all-training-sets-and-probe-types

LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe types

Larger models linearly represent more general concepts including truth

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (1)

claim

As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputs
associated_withsupports
Interpretive claim connecting scale to abstraction level in LLM representations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing techniquefinding0.897
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.843
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Mass-mean probes generalize about as well as LR and CCS for LLaMA-2-13B and 70Bfinding0.837
Despite being simpler and optimization-free, MM probes match accuracy of other techniques at scale
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.833
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.831
Model-specific difference in persona susceptibility
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.812
Central interpretive claim of the paper supported by causal ablation and activation evidence
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.804
Shows behavioral pattern of self-correction is trainable in smaller models
26 candidate off-topic detector latents identified in Llama-3.3-70B via contrastive searchfinding0.803
Core mechanistic finding identifying specific SAE latents associated with ESR