finding
active
finding:final-token-position-consistently-yields-the-strongest-truth-interventions-across-modelsFinal token position consistently yields the strongest truth interventions across models
Experiment 1 finding on token position, consistent with prior work
Source paper
extracted_from(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core definitional quote for performative chain-of-thought
- Concept cone truth interventions would generalize to larger frontier models and multimodal settingshypothesis0.751Key robustness question raised as future work
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.747Safety intervention that relies on activation modification, which ESR might undermine
- The model appears to encode truth differently under passive versus active truth evaluation prompts.claim0.744Key finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.
- Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
- Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
- The central empirical claim of the paper, supported by activation probing evidence
- Argues against the single-layer analysis approach of prior work.