thinker:kevin-shengyang-yuKevin Shengyang Yu
Authored papers (1)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs2025
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of orthonormal basis vectors, each of which independently causally mediates true/false behavior. Applying the gradient-based concept cone framework (introduced by Wollschläger et al. 2025 for refusal) to truth, experiments across Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B show that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate (ASR) across all tested cone dimensionalities from 1 to 5, confirming at least a 5-dimensional truth-mediating subspace in those models. Directional ablation using discovered cone vectors on 200 Alpaca prompts yields mean KL divergences of only 0.026–0.045 across models, confirming surgical specificity. Cosine similarities between the classic difference-in-means (DIM) truth vector and all cone basis vectors beyond the first are on the order of 10⁻⁹, establishing that the additional axes are genuinely orthogonal to DIM rather than refinements of it. Truth-related directions reliably emerge between 60–75% of normalized layer depth, peaking at the final token position. These findings imply that models may be more vulnerable to adversarial manipulation of truthfulness than single-direction accounts suggest, because multiple independently steerable dimensions of factual behavior exist and can be exploited without disturbing the primary direction detectable by standard probing.
More papers — OpenAlex / S2
Co-authors (12)
- Cole Blondin6 shared
- Kevin Zhu6 shared
- Oscar Yasunaga6 shared
- Sean O’Brien6 shared
- Vaidehi Bulusu6 shared
- Vasu Sharma6 shared
- Lau, Clayton4 shared
- Amos Azaria2 shared
- Arditi et al.2 shared
- Clayton Lau2 shared
- Max Tegmark2 shared
- Samuel Marks2 shared