Kevin Shengyang Yu

orcid 0000-0003-0496-2365 openalex A5063752356 name_hash 926a1725710f4ad8a57df8cc…

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs2025
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of orthonormal basis vectors, each of which independently causally mediates true/false behavior. Applying the gradient-based concept cone framework (introduced by Wollschläger et al. 2025 for refusal) to truth, experiments across Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B show that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate (ASR) across all tested cone dimensionalities from 1 to 5, confirming at least a 5-dimensional truth-mediating subspace in those models. Directional ablation using discovered cone vectors on 200 Alpaca prompts yields mean KL divergences of only 0.026–0.045 across models, confirming surgical specificity. Cosine similarities between the classic difference-in-means (DIM) truth vector and all cone basis vectors beyond the first are on the order of 10⁻⁹, establishing that the additional axes are genuinely orthogonal to DIM rather than refinements of it. Truth-related directions reliably emerge between 60–75% of normalized layer depth, peaking at the final token position. These findings imply that models may be more vulnerable to adversarial manipulation of truthfulness than single-direction accounts suggest, because multiple independently steerable dimensions of factual behavior exist and can be exploited without disturbing the primary direction detectable by standard probing.

More papers — OpenAlex / S2

Co-authors (12)

Cole Blondin6 shared
Kevin Zhu6 shared
Oscar Yasunaga6 shared
Sean O’Brien6 shared
Vaidehi Bulusu6 shared
Vasu Sharma6 shared
Lau, Clayton4 shared
Amos Azaria2 shared
Arditi et al.2 shared
Clayton Lau2 shared
Max Tegmark2 shared
Samuel Marks2 shared