finding
active
finding:the-difficulty-boundary-for-truth-directions-replicates-across-all-four-tested-models-llama-3-2-3b-llama-3-1-8b-gemma-2-2b-gemma-2-9b-generalization-to-f3-f5-remains-consistently-low-regardless-of-model-size-or-familyThe difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.
Establishes generalizability of the core difficulty-boundary finding across model families.
Source paper
extracted_from(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows the instruction effect, while shifting geometry, may not produce consistent generalization effects across model families.
- Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.851Experiment 1 finding localizing where truth can be causally mediated
- Replication across open-weight models supports scale-emergence finding
- Shows behavioral pattern of self-correction is trainable in smaller models
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.811Central interpretive claim of the paper supported by causal ablation and activation evidence
- Establishes task difficulty as a hard limit that instructions cannot overcome.
- Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates