finding

active

finding:in-llama-2-13b-cities-and-neg-cities-show-antipodal-alignment-in-early-layers-rotate-to-orthogonal-in-middle-layers-then-eventually-align-in-later-layers

In LLaMA-2-13B, cities and neg_cities show antipodal alignment in early layers, rotate to orthogonal in middle layers, then eventually align in later layers

Layer-by-layer evolution of truth direction alignment, supporting hierarchical abstraction hypothesis

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (1)

claim

LLMs hierarchically develop understanding of their input data, progressing from surface-level features in early layers to more abstract concepts in later layers
supports
Interpretation of the layer-by-layer PCA visualizations showing linear structure emerging in early-middle layers

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In LLaMA-2-13B, cities and neg_cities show approximately orthogonal axes of separation in PCA visualizations at intermediate layersfinding0.867
Case of misalignment showing that the truth direction is not always shared between a dataset and its negation in smaller models
In early layers, LLaMA-2-13B represents a 'close association' feature that correlates with truth on cities but anti-correlates on neg_citiesclaim0.836
Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
In LLaMA-2-13B, larger_than and smaller_than separate along antipodal directions in PCA; in LLaMA-2-70B they align along a common directionfinding0.805
Scale-dependent alignment result demonstrating how more abstract truth representations emerge with scale
Antipodal alignment between related datasets (e.g., larger_than and smaller_than) in smaller models resolves to common-direction alignment in larger modelsclaim0.786
Scale-dependent structural finding from PCA visualizations in §4
Layer-wise geometry shows early dip, mid-layer alignment, and late standardization across tasksclaim0.784
Qualitative pattern from E3.
For neg_cities, truth value and LLaMA-2-70B log probability correlate at r=-0.63; for neg_sp_en_trans at r=-0.89finding0.775
Demonstrates strong anti-correlation between text probability and truth in negated datasets
The representation-based path and the behavior-based path in Llama-3.1 8B activation space trace out similar curves, demonstrating bidirectional geometry alignment.finding0.773
Key empirical result showing that optimizing for behavioral outputs and fitting representation geometry produce the same path in activation space.
Math and code tasks show strongest mid-layer anchoring on LLaMA (S ≈ −1.65 at layers 8-12)finding0.762
Task-specific E3 finding showing compositional reasoning requires deeper processing