finding
active
finding:with-unrestricted-vocabulary-models-occasionally-respond-in-non-english-yes-no-equivalents-e-g-si-nein-after-truth-direction-interventionsWith unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventions
Suggestive evidence for language-independent truth representation in LLMs
Source paper
extracted_from(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4
Neighborhood — ranked by edge-count
Papers (1)
paper
Hypotheses (1)
hypothesis
- Suggested by non-English Yes/No outputs post-intervention, requiring further investigation
Concepts (1)
concept
- Observation that truth-direction interventions elicit non-English Yes/No equivalents, suggesting language-independent truth encoding
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- The core motivating question of the paper, framed by Christiano et al. (2021)
- All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.793In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
- Demonstrates NLAs' ability to surface hypotheses that lead to discovery of root cause (malformed training data).
- Specific question motivating the cross-template generalization experiment in Section 5.2.
- Shows the passive vs. active divide is more important than the specific wording of instructions.
- Safety implication derived from multi-dimensional truth structure finding
- The model appears to encode truth differently under passive versus active truth evaluation prompts.claim0.777Key finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.