finding

active

finding:for-neg-cities-truth-value-and-llama-2-70b-log-probability-correlate-at-r-0-63-for-neg-sp-en-trans-at-r-0-89

For neg_cities, truth value and LLaMA-2-70B log probability correlate at r=-0.63; for neg_sp_en_trans at r=-0.89

Demonstrates strong anti-correlation between text probability and truth in negated datasets

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (2)

claim

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
supports
Establishes that the observed linear structure is not merely a representation of text probability
Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlations
supports
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In early layers, LLaMA-2-13B represents a 'close association' feature that correlates with truth on cities but anti-correlates on neg_citiesclaim0.820
Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
In LLaMA-2-13B, cities and neg_cities show antipodal alignment in early layers, rotate to orthogonal in middle layers, then eventually align in later layersfinding0.775
Layer-by-layer evolution of truth direction alignment, supporting hierarchical abstraction hypothesis
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.768
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.760
Establishes generalizability of the core difficulty-boundary finding across model families.
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scalefinding0.758
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
Correlation between layer-wise scores and task accuracy ρ = −0.73 (p < 0.001) on LLaMAfinding0.758
Core E3 finding validating S as a predictor of anchoring effectiveness
In LLaMA-2-13B, cities and neg_cities show approximately orthogonal axes of separation in PCA visualizations at intermediate layersfinding0.756
Case of misalignment showing that the truth direction is not always shared between a dataset and its negation in smaller models
Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processingfinding0.749
Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims