claim
active
claim:discovered-truth-directions-are-highly-specific-and-do-not-interfere-with-general-instruction-following-behaviorDiscovered truth directions are highly specific and do not interfere with general instruction-following behavior
Interpretation of KL divergence retention results
Source paper
extracted_from(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Experiment 3 result showing minimal behavioral drift from truth intervention in Qwen 14B
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Open question on generalization beyond Gemma and Qwen families
- Specific question motivating the cross-template generalization experiment in Section 5.2.
- Research question motivating Section 5.
- Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.790Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- Does instructing the model to assess correctness affect the geometry of truth directions?question0.785One of the three guiding research questions of the paper.
- A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
- Where inside the LLM should we look for an accurate truth direction that will generalize the most across tasks?question0.781One of the three guiding research questions of the paper.