finding

active

finding:truth-related-directions-reliably-emerge-at-60-75-of-normalized-layer-depth-in-qwen-and-gemma-models

Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma models

Experiment 1 finding localizing where truth can be causally mediated

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Neighborhood — ranked by edge-count

Papers (1)

paper

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
introduces

Claims (1)

claim

Truthful behavior in LLMs is not confined to a single linear axis; multiple orthogonal directions can independently mediate it
supports
Central interpretive claim of the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.851
Establishes generalizability of the core difficulty-boundary finding across model families.
Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.828
Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.815
Argues against the single-layer analysis approach of prior work.
Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasetsfinding0.814
Empirical observation about which network layers encode reflection-relevant information.
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.802
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Universality claims for truth directions are more limited than previously assumed, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.claim0.795
Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
In early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.finding0.792
Variance decomposition showing the disentanglement of polarity from truth across model depth.
Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.789
Central empirical conclusion of the paper about the fundamental limits of truth directions.