finding

active

finding:in-llama-2-13b-cities-and-neg-cities-show-approximately-orthogonal-axes-of-separation-in-pca-visualizations-at-intermediate-layers

In LLaMA-2-13B, cities and neg_cities show approximately orthogonal axes of separation in PCA visualizations at intermediate layers

Case of misalignment showing that the truth direction is not always shared between a dataset and its negation in smaller models

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (2)

claim

As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputs
supports
Interpretive claim connecting scale to abstraction level in LLM representations
Antipodal alignment between related datasets (e.g., larger_than and smaller_than) in smaller models resolves to common-direction alignment in larger models
supports
Scale-dependent structural finding from PCA visualizations in §4

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In LLaMA-2-13B, cities and neg_cities show antipodal alignment in early layers, rotate to orthogonal in middle layers, then eventually align in later layersfinding0.867
Layer-by-layer evolution of truth direction alignment, supporting hierarchical abstraction hypothesis
PCA visualizations of LLaMA-2-13B and 70B representations of curated datasets show clear linear structure, with true statements separating from false ones in the top two principal componentsfinding0.805
Primary visual evidence for linear truth representations in large LLMs
In LLaMA-2-13B, larger_than and smaller_than separate along antipodal directions in PCA; in LLaMA-2-70B they align along a common directionfinding0.798
Scale-dependent alignment result demonstrating how more abstract truth representations emerge with scale
In early layers, LLaMA-2-13B represents a 'close association' feature that correlates with truth on cities but anti-correlates on neg_citiesclaim0.794
Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
Linear structure in LLaMA-2-13B representations emerges rapidly in early-middle layers, later for conjunctive statementsfinding0.760
Layer-wise PCA analysis shows hierarchical development of truth representations across forward pass
In LLaMA-2-13B, salient linear structure in the top PCs rapidly emerges in early-middle layers, with this emergence occurring later for conjunctive statements than simple statementsfinding0.760
Layer-wise emergence pattern supporting hierarchical development hypothesis
For neg_cities, truth value and LLaMA-2-70B log probability correlate at r=-0.63; for neg_sp_en_trans at r=-0.89finding0.756
Demonstrates strong anti-correlation between text probability and truth in negated datasets
The representation-based path and the behavior-based path in Llama-3.1 8B activation space trace out similar curves, demonstrating bidirectional geometry alignment.finding0.740
Key empirical result showing that optimizing for behavioral outputs and fitting representation geometry produce the same path in activation space.