paper:doi-10-48550-arxiv-2505-21800From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
TL;DR
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of orthonormal basis vectors, each of which independently causally mediates true/false behavior. Applying the gradient-based concept cone framework (introduced by Wollschläger et al. 2025 for refusal) to truth, experiments across Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B show that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate (ASR) across all tested cone dimensionalities from 1 to 5, confirming at least a 5-dimensional truth-mediating subspace in those models. Directional ablation using discovered cone vectors on 200 Alpaca prompts yields mean KL divergences of only 0.026–0.045 across models, confirming surgical specificity. Cosine similarities between the classic difference-in-means (DIM) truth vector and all cone basis vectors beyond the first are on the order of 10⁻⁹, establishing that the additional axes are genuinely orthogonal to DIM rather than refinements of it. Truth-related directions reliably emerge between 60–75% of normalized layer depth, peaking at the final token position. These findings imply that models may be more vulnerable to adversarial manipulation of truthfulness than single-direction accounts suggest, because multiple independently steerable dimensions of factual behavior exist and can be exploited without disturbing the primary direction detectable by standard probing.
What to take away
- 1. Qwen2.5-7B and Gemma-2-9B maintain near-100% Answer Switching Rate (ASR) across cone dimensionalities 1 through 5, demonstrating that at least a 5-dimensional concept cone causally mediates propositional truth in those models.
- 2. Truth-mediating directions reliably emerge between 60–75% of normalized layer depth across all tested Qwen2.5 and Gemma-2 variants, peaking at the final token position, consistent with prior findings on high-level decision accumulation.
- 3. The concept cone framework is operationalized with a three-term loss (L_add + L_ablate + L_retain), where L_retain is measured on 30-token continuations of Alpaca instructions to guard against collateral behavioral drift.
- 4. Directional ablation of discovered truth cones on 200 Alpaca prompts yields mean KL divergences of 0.038, 0.045, 0.026, and 0.031 for Qwen2.5-14B, Gemma-2-2B, Qwen2.5-7B, and Gemma-2-9B respectively, indicating minimal interference with general instruction-following.
- 5. Cosine similarities between the difference-in-means (DIM) truth vector and cone basis vectors v2 through v5 in Gemma-2-9B are on the order of 10⁻⁹, confirming these axes encode orthogonal structure absent from the classical linear direction.
- 6. Smaller models show non-monotonic ASR with increasing cone dimensionality: Gemma-2-2B drops to 53.7% at dim-3 and 27.1% at dim-5, while Qwen2.5-3B drops to 45.1% at dim-2 before partially recovering, suggesting representational capacity limits truth subspace dimensionality.
- 7. The methodology for cone discovery uses a gradient-based optimization over an orthonormal basis with binary cross-entropy targets (restricting output logits to 'Yes'/'No' tokens) and Monte Carlo sampling of 64 random nonnegative-coefficient directions per cone for evaluation.
- 8. Applying the same concept cone framework to sentiment (Stanford Sentiment Treebank) and toxicity (ToxiGen, 274,000 phrases) failed to yield valid cones, suggesting the method's success on truth is not trivially universal across abstract behavioral properties.
- 9. It remains an open question whether the discovered orthogonal cone axes correspond to semantically interpretable facets of truth (e.g., temporal vs. geographic vs. commonsense facts) or are artifacts of the gradient-based optimization without inherent semantic meaning.
- 10. Models occasionally output non-English equivalents of 'Yes' and 'No' (e.g., 'Sí', 'Nein') following truth-direction interventions when output vocabulary is unrestricted, raising the hypothesis that the identified truth subspace may encode a language-agnostic representation of factuality.
Peer brief — for seminar discussion
Yu et al. extend the concept cone framework—originally introduced by Wollschläger et al. 2025 for characterizing refusal geometry—to the domain of propositional truth, asking whether truth in LLMs is encoded as a single linear direction or as a richer multi-dimensional subspace. Working with five open-source models (Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B) and three factual datasets (cities from Marks & Tegmark 2024, element_symb and animals_class from Azaria & Mitchell 2023), they learn orthonormal basis vectors via gradient descent over a composite loss that rewards causal steering of binary Yes/No truth judgments while penalizing drift on Alpaca instruction-following prompts. The central finding is that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate across cone dimensionalities 1–5, establishing a genuinely 5-dimensional truth-mediating subspace, while cosine similarities between the classical difference-in-means direction and all cone axes beyond the first are on the order of 10⁻⁹—meaning standard linear probing captures only one facet of the underlying geometry. The interventions are also remarkably surgical: mean KL divergence on 200 Alpaca prompts ranges from 0.026 (Qwen2.5-7B) to 0.045 (Gemma-2-2B), well under the 0.1 threshold used as a quality filter following Arditi et al. 2024. Truth-mediating directions cluster between 60–75% of normalized layer depth and are strongest at the final token position, consistent with the picture of high-level features accumulating late in the residual stream. The paper's broader implication is that multiple independently steerable dimensions of factual behavior exist, making models potentially more vulnerable to subtle adversarial manipulation that bypasses the primary truth direction detectable by probing; this constitutes an implicit prediction that single-direction defenses against hallucination or deception will be incomplete. An alternative method that could have been used is sparse autoencoder decomposition of the residual stream, which provides overlapping evidence about multi-dimensional feature geometry but lacks the explicit causal validation through activation steering that concept cones afford. The most contestable aspect is scope: all experiments are confined to simple, unambiguous propositional facts (e.g., 'The Eiffel Tower is in Paris') in models ranging only from 2B to 14B parameters. It is entirely unclear whether the identified 5-dimensional subspace generalizes to larger frontier models, instruction-tuned models trained with RLHF, or more semantically complex truth conditions involving context-dependence, uncertainty, or subjectivity. Critically, the paper itself concedes that the individual cone axes have no assigned semantic interpretation—there is no evidence that the orthogonal dimensions correspond to meaningful facets like temporal versus geographic facts versus commonsense, rather than being optimization artifacts. A critical reader would also note that the failure to find valid cones for sentiment (Stanford Sentiment Treebank) or toxicity (ToxiGen, 274,000 phrases) is discussed only in an appendix and is undertheorized: it is not explained why truth yields a clean multi-dimensional cone while these other abstract properties do not, which raises questions about whether the success on truth is principled or domain-specific.
Methods (3)
- Answer Switching Rate (ASR)Key evaluation metric: proportion of inputs for which an intervention successfully flips model output
- Loss-Guided Concept Cone DiscoveryOptimization procedure that learns orthonormal basis vectors satisfying causal truth and retention constraints via composite loss
- Monte Carlo Cone SamplingProcedure for sampling 64 random nonnegative combinations of cone basis vectors to evaluate the full cone distribution
Frameworks (2)
- Concept ConesThe central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
- Linear Representation HypothesisThe hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Findings (14)
- In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
- With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventions
Suggestive evidence for language-independent truth representation in LLMs
- In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9
Appendix E replication of DIM alignment finding in Qwen model
- ASR spikes rapidly in all tested models in the 0.60–0.75 normalized layer range before decreasing sharply in final layers
Core layer localization finding from Experiment 1
- Qwen-2.5-3B ASR drops from 98.6% at dim 1 to 45.1% at dim 2, recovering partially then declining to 65.3% at dim 5
Smaller models show non-monotonic and diminished ASR with increasing cone dimensionality
- DIM-based ablation direction for toxicity on ToxiGen produced unintelligible output; no valid concept cone found
Negative result from toxicity extension showing difficulty obtaining valid linear directions for toxicity
- Alternative tokenizations Yes/No vs yes/no vs true/false had no significant effect on steering outcomes or ASR
Robustness check on token choice for binary classification
- Concept cone methodology failed to produce a meaningful cone for sentiment on Stanford Sentiment Treebank
Negative result from sentiment extension showing concept cones do not trivially generalize
- Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma models
Experiment 1 finding localizing where truth can be causally mediated
- Qwen-2.5-14B mean KL divergence on Alpaca prompts after truth-direction ablation is 0.038
Experiment 3 result showing minimal behavioral drift from truth intervention in Qwen 14B
Claims (8)
- Multiple semantically adjacent truth directions make models more vulnerable to manipulations that shift outputs without obvious signs in the primary truth direction
Safety implication derived from multi-dimensional truth structure finding
- Truthful behavior in LLMs is not confined to a single linear axis; multiple orthogonal directions can independently mediate it
Central interpretive claim of the paper
- The L_retain regularization objective is empirically effective at preserving unrelated model capabilities during cone training
Interpretation of low KL divergence results as validation of the training objective
- Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axis
Interpretive synthesis of DIM and cone intervention successes
- DIM captures only one facet of the multi-dimensional truth subspace; additional orthogonal structure exists beyond it
Interpretation of Experiment 4 cosine similarity results
- Discovered truth directions are highly specific and do not interfere with general instruction-following behavior
Interpretation of KL divergence retention results
- Representational abstraction of truth may emerge more clearly with model scale
Interpretation of weaker PCA separation and lower ASR in smaller models
- Larger models can support higher-dimensional truth cones than smaller models
Interpretation of ASR degradation patterns by model size across cone dimensions
Hypotheses (3)
- Individual cone basis vectors may correspond to interpretable semantic facets of truth such as temporal facts, geographic facts, or commonsense
Future direction hypothesis for giving semantic meaning to individual axes
- Concept cone truth interventions would generalize to larger frontier models and multimodal settings
Key robustness question raised as future work
- The underlying truth representation may generalize across lexical choices and languages
Suggested by non-English Yes/No outputs post-intervention, requiring further investigation
Questions (5)
- Does the multi-directional nature of truth imply an underlying nonlinear representation, or is it compatible with linear separability?
Theoretical open question about the geometry of truth in LLMs raised in Discussion
- Are the discovered truth directions robust to architectural variation and fine-tuning differences across model families?
Open question on generalization beyond Gemma and Qwen families
- How can we discover a maximally informative or interpretable truth subspace rather than just a sufficient one?
Limitation-driven open question about subspace optimality
- Does sentiment have a higher-dimensional concept cone representation, and if so, what methods could find it?
Open question raised by failed sentiment cone extension
- What semantic labels correspond to the individual basis vectors of the truth cone?
Central open question for future work on interpretability of cone axes
Original abstract (expand)
Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetscitedin corpus2023≈ 85%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 86%
- SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language ModelsRaffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio Giorgio Piras2026≈ 83%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 82%
- Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive RefinementLin Xu, Yang Sun, Wenjun Li, Jie Shi Yuxiao Lu2026≈ 82%
- Psychological Steering of Large Language Modelsin corpus2026≈ 81%
- Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations CategoriesXianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma Tianlong Wang2025≈ 81%
- Tensor Product Representation Probes Reveal Shared Structure Across Linear DirectionsFernanda Vi\'egas, Martin Wattenberg Andrew Lee2026≈ 81%
- A Implies B: Circuit Analysis in LLMs for Propositional Logical ReasoningNishanth Dikkala, Enming Luo, Cyrus Rashtchian, Xin Wang, Rina Panigrahy Guan Zhe Hong2025≈ 81%
- Causal Probing for Internal Visual Representations in Multimodal Large Language ModelsTianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng2026≈ 81%
- Constructing Interpretable Features from Compositional Neuron GroupsAtticus Geiger, Mor Geva Or Shafran2026≈ 81%
- Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language ModelsMingyu Kang, Yong Suk Choi Keuntae Kim2026≈ 80%
- Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination MitigationZekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li2025≈ 80%
- Patches of Nonlinearity: Instruction Vectors in Large Language ModelsJonas Rohweder, Subhabrata Dutta, Iryna Gurevych Irina Bigoulaeva2026≈ 80%
- The Logical Implication Steering Method for Conditional Interventions on Transformer GenerationDamjan Kalajdzievski2025≈ 80%
- Can LLMs Lie? Investigation beyond HallucinationMihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan2025≈ 80%
- Identifying and Transferring Reasoning-Critical Neurons: Improving LLM Inference Reliability via Activation SteeringZuming Yan, Xuri Ge, Zhiwei Xu, Mengqi Zhang, Xuanang Chen, Ben He, Xin Xin, Zhumin Chen, Ying Zhou Fangan Dong2026≈ 80%
- ≈ 80%
- Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of MindDingyi Zhang and Deyu Zhou2025≈ 80%
- ≈ 80%
- Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language ModelsSamuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott Danae S\'anchez Villegas2026≈ 80%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 79%
- The Platonic Representation Hypothesisin corpus2024≈ 79%
- ≈ 79%
- ≈ 79%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 79%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 79%
- ≈ 74%
- ≈ 71%
- Zoom In: An Introduction to Circuitscited2020≈ 71%
+22 more