thinker:sean-o-brienSean O’Brien
Authored papers (1)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs2025
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of orthonormal basis vectors, each of which independently causally mediates true/false behavior. Applying the gradient-based concept cone framework (introduced by Wollschläger et al. 2025 for refusal) to truth, experiments across Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B show that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate (ASR) across all tested cone dimensionalities from 1 to 5, confirming at least a 5-dimensional truth-mediating subspace in those models. Directional ablation using discovered cone vectors on 200 Alpaca prompts yields mean KL divergences of only 0.026–0.045 across models, confirming surgical specificity. Cosine similarities between the classic difference-in-means (DIM) truth vector and all cone basis vectors beyond the first are on the order of 10⁻⁹, establishing that the additional axes are genuinely orthogonal to DIM rather than refinements of it. Truth-related directions reliably emerge between 60–75% of normalized layer depth, peaking at the final token position. These findings imply that models may be more vulnerable to adversarial manipulation of truthfulness than single-direction accounts suggest, because multiple independently steerable dimensions of factual behavior exist and can be exploited without disturbing the primary direction detectable by standard probing.
More papers — OpenAlex / S2
Co-authors (12)
- Cole Blondin9 shared
- Kevin Zhu9 shared
- Oscar Yasunaga9 shared
- Vaidehi Bulusu9 shared
- Vasu Sharma9 shared
- Kevin Shengyang Yu6 shared
- Lau, Clayton6 shared
- Amos Azaria3 shared
- Arditi et al.3 shared
- Clayton Lau3 shared
- Max Tegmark3 shared
- Samuel Marks3 shared
Recent mentions (1)
- papers-typedyu-2025-directions-cones.md