finding

active

finding:final-token-position-consistently-yields-the-strongest-truth-interventions-across-models

Final token position consistently yields the strongest truth interventions across models

Experiment 1 finding on token position, consistent with prior work

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal beliefquote0.762
Core definitional quote for performative chain-of-thought
Concept cone truth interventions would generalize to larger frontier models and multimodal settingshypothesis0.751
Key robustness question raised as future work
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.747
Safety intervention that relies on activation modification, which ESR might undermine
The model appears to encode truth differently under passive versus active truth evaluation prompts.claim0.744
Key finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.
Universality claims for truth directions are more limited than previously assumed, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.claim0.742
Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgmentsfinding0.736
Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
Reasoning models generate performative CoT tokens after achieving strong confidence in their final answer without revealing this belief in textclaim0.735
The central empirical claim of the paper, supported by activation probing evidence
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.734
Argues against the single-layer analysis approach of prior work.