claim

active

claim:the-generalization-improvement-from-explicit-instructions-observed-in-llama-models-a1-a3-to-f0-f2-is-more-pronounced-for-f3-f5-to-f0-f2-in-gemma-models

The generalization improvement from explicit instructions observed in Llama models (A1-A3 to F0-F2) is more pronounced for F3-F5 to F0-F2 in Gemma models.

Shows the instruction effect, while shifting geometry, may not produce consistent generalization effects across model families.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

Using the ask-correct prompt improves cross-task generalization of arithmetic probes to factual tasks F0-F2.
extends
Finding that explicit correctness framing partially aligns truth directions across task families.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.860
Establishes generalizability of the core difficulty-boundary finding across model families.
Within-family factual generalization (F0-F2) is consistently strong across all models and prompt settings.finding0.798
Establishes a reliable baseline for factual truth direction universality within simple factual recall.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.798
Core empirical finding about layer-dependent truth direction emergence across task types.
Gemma 2: Improving Open Language Models at a Practical Size (Team et al., 2024)concept0.785
Paper describing Gemma 2 model family used in this study
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.780
Shows behavioral pattern of self-correction is trainable in smaller models
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.777
Key limitation acknowledged by authors.
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.777
Experiment 1 finding localizing where truth can be causally mediated
LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.776
Larger models linearly represent more general concepts including truth