finding

active

finding:with-unrestricted-vocabulary-models-occasionally-respond-in-non-english-yes-no-equivalents-e-g-si-nein-after-truth-direction-interventions

With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventions

Suggestive evidence for language-independent truth representation in LLMs

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Neighborhood — ranked by edge-count

Papers (1)

paper

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
introduces

Hypotheses (1)

hypothesis

The underlying truth representation may generalize across lexical choices and languages
supports
Suggested by non-English Yes/No outputs post-intervention, requiring further investigation

Concepts (1)

concept

Cross-Lingual Truth Representation
supports
Observation that truth-direction interventions elicit non-English Yes/No equivalents, suggesting language-independent truth encoding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.804
Motivating hypothesis for Section 5's investigation of prompt template effects.
Given a language model M and a statement s, does M believe s to be true?question0.793
The core motivating question of the paper, framed by Christiano et al. (2021)
All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.793
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
Opus 4.6 spontaneously responded in Russian to an English prompt; NLA explanations revealed the model was fixated on the hypothesis that the user was a non-native English speaker.finding0.791
Demonstrates NLAs' ability to surface hypotheses that lead to discovery of root cause (malformed training data).
Will the no-prompt truth directions generalize to ask-correct activations?question0.791
Specific question motivating the cross-template generalization experiment in Section 5.2.
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.786
Shows the passive vs. active divide is more important than the specific wording of instructions.
Multiple semantically adjacent truth directions make models more vulnerable to manipulations that shift outputs without obvious signs in the primary truth directionclaim0.783
Safety implication derived from multi-dimensional truth structure finding
The model appears to encode truth differently under passive versus active truth evaluation prompts.claim0.777
Key finding from Section 5 based on low cosine similarity between no-prompt and ask-correct probes.