From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

ByKevin Shengyang Yu ⓘ·Vaidehi Bulusu·Oscar Yasunaga·Clayton Lau·Cole Blondin·Sean O’Brien ⓘ+2 more

DOI 10.48550/arxiv.2505.21800 arXiv 2505.21800 OpenAlex W4416046431

Cross-Lingual Truth Representation Concept Cones Answer Switching Rate (ASR)Mechanistic Interpretability Linear Representation Hypothesis Loss-Guided Concept Cone Discovery Propositional Truth Monte Carlo Cone Sampling Superposition

TL;DR

Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of orthonormal basis vectors, each of which independently causally mediates true/false behavior. Applying the gradient-based concept cone framework (introduced by Wollschläger et al. 2025 for refusal) to truth, experiments across Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B show that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate (ASR) across all tested cone dimensionalities from 1 to 5, confirming at least a 5-dimensional truth-mediating subspace in those models. Directional ablation using discovered cone vectors on 200 Alpaca prompts yields mean KL divergences of only 0.026–0.045 across models, confirming surgical specificity. Cosine similarities between the classic difference-in-means (DIM) truth vector and all cone basis vectors beyond the first are on the order of 10⁻⁹, establishing that the additional axes are genuinely orthogonal to DIM rather than refinements of it. Truth-related directions reliably emerge between 60–75% of normalized layer depth, peaking at the final token position. These findings imply that models may be more vulnerable to adversarial manipulation of truthfulness than single-direction accounts suggest, because multiple independently steerable dimensions of factual behavior exist and can be exploited without disturbing the primary direction detectable by standard probing.

What to take away

1. Qwen2.5-7B and Gemma-2-9B maintain near-100% Answer Switching Rate (ASR) across cone dimensionalities 1 through 5, demonstrating that at least a 5-dimensional concept cone causally mediates propositional truth in those models.
2. Truth-mediating directions reliably emerge between 60–75% of normalized layer depth across all tested Qwen2.5 and Gemma-2 variants, peaking at the final token position, consistent with prior findings on high-level decision accumulation.
3. The concept cone framework is operationalized with a three-term loss (L_add + L_ablate + L_retain), where L_retain is measured on 30-token continuations of Alpaca instructions to guard against collateral behavioral drift.
4. Directional ablation of discovered truth cones on 200 Alpaca prompts yields mean KL divergences of 0.038, 0.045, 0.026, and 0.031 for Qwen2.5-14B, Gemma-2-2B, Qwen2.5-7B, and Gemma-2-9B respectively, indicating minimal interference with general instruction-following.
5. Cosine similarities between the difference-in-means (DIM) truth vector and cone basis vectors v2 through v5 in Gemma-2-9B are on the order of 10⁻⁹, confirming these axes encode orthogonal structure absent from the classical linear direction.
6. Smaller models show non-monotonic ASR with increasing cone dimensionality: Gemma-2-2B drops to 53.7% at dim-3 and 27.1% at dim-5, while Qwen2.5-3B drops to 45.1% at dim-2 before partially recovering, suggesting representational capacity limits truth subspace dimensionality.
7. The methodology for cone discovery uses a gradient-based optimization over an orthonormal basis with binary cross-entropy targets (restricting output logits to 'Yes'/'No' tokens) and Monte Carlo sampling of 64 random nonnegative-coefficient directions per cone for evaluation.
8. Applying the same concept cone framework to sentiment (Stanford Sentiment Treebank) and toxicity (ToxiGen, 274,000 phrases) failed to yield valid cones, suggesting the method's success on truth is not trivially universal across abstract behavioral properties.
9. It remains an open question whether the discovered orthogonal cone axes correspond to semantically interpretable facets of truth (e.g., temporal vs. geographic vs. commonsense facts) or are artifacts of the gradient-based optimization without inherent semantic meaning.
10. Models occasionally output non-English equivalents of 'Yes' and 'No' (e.g., 'Sí', 'Nein') following truth-direction interventions when output vocabulary is unrestricted, raising the hypothesis that the identified truth subspace may encode a language-agnostic representation of factuality.

Peer brief — for seminar discussion

Yu et al. extend the concept cone framework—originally introduced by Wollschläger et al. 2025 for characterizing refusal geometry—to the domain of propositional truth, asking whether truth in LLMs is encoded as a single linear direction or as a richer multi-dimensional subspace. Working with five open-source models (Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B) and three factual datasets (cities from Marks & Tegmark 2024, element_symb and animals_class from Azaria & Mitchell 2023), they learn orthonormal basis vectors via gradient descent over a composite loss that rewards causal steering of binary Yes/No truth judgments while penalizing drift on Alpaca instruction-following prompts. The central finding is that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate across cone dimensionalities 1–5, establishing a genuinely 5-dimensional truth-mediating subspace, while cosine similarities between the classical difference-in-means direction and all cone axes beyond the first are on the order of 10⁻⁹—meaning standard linear probing captures only one facet of the underlying geometry. The interventions are also remarkably surgical: mean KL divergence on 200 Alpaca prompts ranges from 0.026 (Qwen2.5-7B) to 0.045 (Gemma-2-2B), well under the 0.1 threshold used as a quality filter following Arditi et al. 2024. Truth-mediating directions cluster between 60–75% of normalized layer depth and are strongest at the final token position, consistent with the picture of high-level features accumulating late in the residual stream. The paper's broader implication is that multiple independently steerable dimensions of factual behavior exist, making models potentially more vulnerable to subtle adversarial manipulation that bypasses the primary truth direction detectable by probing; this constitutes an implicit prediction that single-direction defenses against hallucination or deception will be incomplete. An alternative method that could have been used is sparse autoencoder decomposition of the residual stream, which provides overlapping evidence about multi-dimensional feature geometry but lacks the explicit causal validation through activation steering that concept cones afford. The most contestable aspect is scope: all experiments are confined to simple, unambiguous propositional facts (e.g., 'The Eiffel Tower is in Paris') in models ranging only from 2B to 14B parameters. It is entirely unclear whether the identified 5-dimensional subspace generalizes to larger frontier models, instruction-tuned models trained with RLHF, or more semantically complex truth conditions involving context-dependence, uncertainty, or subjectivity. Critically, the paper itself concedes that the individual cone axes have no assigned semantic interpretation—there is no evidence that the orthogonal dimensions correspond to meaningful facets like temporal versus geographic facts versus commonsense, rather than being optimization artifacts. A critical reader would also note that the failure to find valid cones for sentiment (Stanford Sentiment Treebank) or toxicity (ToxiGen, 274,000 phrases) is discussed only in an appendix and is undertheorized: it is not explained why truth yields a clean multi-dimensional cone while these other abstract properties do not, which raises questions about whether the success on truth is principled or domain-specific.

Methods (3)

Answer Switching Rate (ASR)
Key evaluation metric: proportion of inputs for which an intervention successfully flips model output
Loss-Guided Concept Cone Discovery
Optimization procedure that learns orthonormal basis vectors satisfying causal truth and retention constraints via composite loss
Monte Carlo Cone Sampling
Procedure for sampling 64 random nonnegative combinations of cone basis vectors to evaluate the full cone distribution

Frameworks (2)

Concept Cones
The central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
Linear Representation Hypothesis
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

Findings (14)

In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventions
Suggestive evidence for language-independent truth representation in LLMs
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9
Appendix E replication of DIM alignment finding in Qwen model
ASR spikes rapidly in all tested models in the 0.60–0.75 normalized layer range before decreasing sharply in final layers
Core layer localization finding from Experiment 1
Qwen-2.5-3B ASR drops from 98.6% at dim 1 to 45.1% at dim 2, recovering partially then declining to 65.3% at dim 5
Smaller models show non-monotonic and diminished ASR with increasing cone dimensionality
DIM-based ablation direction for toxicity on ToxiGen produced unintelligible output; no valid concept cone found
Negative result from toxicity extension showing difficulty obtaining valid linear directions for toxicity
Alternative tokenizations Yes/No vs yes/no vs true/false had no significant effect on steering outcomes or ASR
Robustness check on token choice for binary classification
Concept cone methodology failed to produce a meaningful cone for sentiment on Stanford Sentiment Treebank
Negative result from sentiment extension showing concept cones do not trivially generalize
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma models
Experiment 1 finding localizing where truth can be causally mediated
Qwen-2.5-14B mean KL divergence on Alpaca prompts after truth-direction ablation is 0.038
Experiment 3 result showing minimal behavioral drift from truth intervention in Qwen 14B

Claims (8)

Multiple semantically adjacent truth directions make models more vulnerable to manipulations that shift outputs without obvious signs in the primary truth direction
Safety implication derived from multi-dimensional truth structure finding
Truthful behavior in LLMs is not confined to a single linear axis; multiple orthogonal directions can independently mediate it
Central interpretive claim of the paper
The L_retain regularization objective is empirically effective at preserving unrelated model capabilities during cone training
Interpretation of low KL divergence results as validation of the training objective
Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axis
Interpretive synthesis of DIM and cone intervention successes
DIM captures only one facet of the multi-dimensional truth subspace; additional orthogonal structure exists beyond it
Interpretation of Experiment 4 cosine similarity results
Discovered truth directions are highly specific and do not interfere with general instruction-following behavior
Interpretation of KL divergence retention results
Representational abstraction of truth may emerge more clearly with model scale
Interpretation of weaker PCA separation and lower ASR in smaller models
Larger models can support higher-dimensional truth cones than smaller models
Interpretation of ASR degradation patterns by model size across cone dimensions

Hypotheses (3)

Individual cone basis vectors may correspond to interpretable semantic facets of truth such as temporal facts, geographic facts, or commonsense
Future direction hypothesis for giving semantic meaning to individual axes
Concept cone truth interventions would generalize to larger frontier models and multimodal settings
Key robustness question raised as future work
The underlying truth representation may generalize across lexical choices and languages
Suggested by non-English Yes/No outputs post-intervention, requiring further investigation

Questions (5)

Does the multi-directional nature of truth imply an underlying nonlinear representation, or is it compatible with linear separability?
Theoretical open question about the geometry of truth in LLMs raised in Discussion
Are the discovered truth directions robust to architectural variation and fine-tuning differences across model families?
Open question on generalization beyond Gemma and Qwen families
How can we discover a maximally informative or interpretable truth subspace rather than just a sufficient one?
Limitation-driven open question about subspace optimality
Does sentiment have a higher-dimensional concept cone representation, and if so, what methods could find it?
Open question raised by failed sentiment cone extension
What semantic labels correspond to the individual basis vectors of the truth cone?
Central open question for future work on interpretability of cone axes

Original abstract (expand)

Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
cited
in corpus
2023
≈ 85%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 86%
SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models
Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio Giorgio Piras
2026
≈ 83%
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
in corpus
2025
≈ 82%
Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement
Lin Xu, Yang Sun, Wenjun Li, Jie Shi Yuxiao Lu
2026
≈ 82%
Psychological Steering of Large Language Models
in corpus
2026
≈ 81%
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Xianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma Tianlong Wang
2025
≈ 81%
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
Fernanda Vi\'egas, Martin Wattenberg Andrew Lee
2026
≈ 81%
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning
Nishanth Dikkala, Enming Luo, Cyrus Rashtchian, Xin Wang, Rina Panigrahy Guan Zhe Hong
2025
≈ 81%
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang Zehao Deng
2026
≈ 81%
Constructing Interpretable Features from Compositional Neuron Groups
Atticus Geiger, Mor Geva Or Shafran
2026
≈ 81%
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Mingyu Kang, Yong Suk Choi Keuntae Kim
2026
≈ 80%
Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation
Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li
2025
≈ 80%
Patches of Nonlinearity: Instruction Vectors in Large Language Models
Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych Irina Bigoulaeva
2026
≈ 80%
The Logical Implication Steering Method for Conditional Interventions on Transformer Generation
Damjan Kalajdzievski
2025
≈ 80%
Can LLMs Lie? Investigation beyond Hallucination
Mihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan
2025
≈ 80%
Identifying and Transferring Reasoning-Critical Neurons: Improving LLM Inference Reliability via Activation Steering
Zuming Yan, Xuri Ge, Zhiwei Xu, Mengqi Zhang, Xuanang Chen, Ben He, Xin Xin, Zhumin Chen, Ying Zhou Fangan Dong
2026
≈ 80%
ReflCtrl: Controlling LLM Reflection via Representation Engineering
in corpus
2025
≈ 80%
Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind
Dingyi Zhang and Deyu Zhou
2025
≈ 80%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 80%
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott Danae S\'anchez Villegas
2026
≈ 80%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 79%
The Platonic Representation Hypothesis
in corpus
2024
≈ 79%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 79%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 79%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 79%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 79%
Steering language models with activation engineering
cited
2023
≈ 74%
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cited
2019
≈ 71%
Zoom In: An Introduction to Circuits
cited
2020
≈ 71%

+22 more