paper
active
2025
paper:doi-10-48550-arxiv-2505-21800

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

TL;DR

Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of orthonormal basis vectors, each of which independently causally mediates true/false behavior. Applying the gradient-based concept cone framework (introduced by Wollschläger et al. 2025 for refusal) to truth, experiments across Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B show that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate (ASR) across all tested cone dimensionalities from 1 to 5, confirming at least a 5-dimensional truth-mediating subspace in those models. Directional ablation using discovered cone vectors on 200 Alpaca prompts yields mean KL divergences of only 0.026–0.045 across models, confirming surgical specificity. Cosine similarities between the classic difference-in-means (DIM) truth vector and all cone basis vectors beyond the first are on the order of 10⁻⁹, establishing that the additional axes are genuinely orthogonal to DIM rather than refinements of it. Truth-related directions reliably emerge between 60–75% of normalized layer depth, peaking at the final token position. These findings imply that models may be more vulnerable to adversarial manipulation of truthfulness than single-direction accounts suggest, because multiple independently steerable dimensions of factual behavior exist and can be exploited without disturbing the primary direction detectable by standard probing.

What to take away

  1. 1. Qwen2.5-7B and Gemma-2-9B maintain near-100% Answer Switching Rate (ASR) across cone dimensionalities 1 through 5, demonstrating that at least a 5-dimensional concept cone causally mediates propositional truth in those models.
  2. 2. Truth-mediating directions reliably emerge between 60–75% of normalized layer depth across all tested Qwen2.5 and Gemma-2 variants, peaking at the final token position, consistent with prior findings on high-level decision accumulation.
  3. 3. The concept cone framework is operationalized with a three-term loss (L_add + L_ablate + L_retain), where L_retain is measured on 30-token continuations of Alpaca instructions to guard against collateral behavioral drift.
  4. 4. Directional ablation of discovered truth cones on 200 Alpaca prompts yields mean KL divergences of 0.038, 0.045, 0.026, and 0.031 for Qwen2.5-14B, Gemma-2-2B, Qwen2.5-7B, and Gemma-2-9B respectively, indicating minimal interference with general instruction-following.
  5. 5. Cosine similarities between the difference-in-means (DIM) truth vector and cone basis vectors v2 through v5 in Gemma-2-9B are on the order of 10⁻⁹, confirming these axes encode orthogonal structure absent from the classical linear direction.
  6. 6. Smaller models show non-monotonic ASR with increasing cone dimensionality: Gemma-2-2B drops to 53.7% at dim-3 and 27.1% at dim-5, while Qwen2.5-3B drops to 45.1% at dim-2 before partially recovering, suggesting representational capacity limits truth subspace dimensionality.
  7. 7. The methodology for cone discovery uses a gradient-based optimization over an orthonormal basis with binary cross-entropy targets (restricting output logits to 'Yes'/'No' tokens) and Monte Carlo sampling of 64 random nonnegative-coefficient directions per cone for evaluation.
  8. 8. Applying the same concept cone framework to sentiment (Stanford Sentiment Treebank) and toxicity (ToxiGen, 274,000 phrases) failed to yield valid cones, suggesting the method's success on truth is not trivially universal across abstract behavioral properties.
  9. 9. It remains an open question whether the discovered orthogonal cone axes correspond to semantically interpretable facets of truth (e.g., temporal vs. geographic vs. commonsense facts) or are artifacts of the gradient-based optimization without inherent semantic meaning.
  10. 10. Models occasionally output non-English equivalents of 'Yes' and 'No' (e.g., 'Sí', 'Nein') following truth-direction interventions when output vocabulary is unrestricted, raising the hypothesis that the identified truth subspace may encode a language-agnostic representation of factuality.

Peer brief — for seminar discussion

Yu et al. extend the concept cone framework—originally introduced by Wollschläger et al. 2025 for characterizing refusal geometry—to the domain of propositional truth, asking whether truth in LLMs is encoded as a single linear direction or as a richer multi-dimensional subspace. Working with five open-source models (Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B) and three factual datasets (cities from Marks & Tegmark 2024, element_symb and animals_class from Azaria & Mitchell 2023), they learn orthonormal basis vectors via gradient descent over a composite loss that rewards causal steering of binary Yes/No truth judgments while penalizing drift on Alpaca instruction-following prompts. The central finding is that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate across cone dimensionalities 1–5, establishing a genuinely 5-dimensional truth-mediating subspace, while cosine similarities between the classical difference-in-means direction and all cone axes beyond the first are on the order of 10⁻⁹—meaning standard linear probing captures only one facet of the underlying geometry. The interventions are also remarkably surgical: mean KL divergence on 200 Alpaca prompts ranges from 0.026 (Qwen2.5-7B) to 0.045 (Gemma-2-2B), well under the 0.1 threshold used as a quality filter following Arditi et al. 2024. Truth-mediating directions cluster between 60–75% of normalized layer depth and are strongest at the final token position, consistent with the picture of high-level features accumulating late in the residual stream. The paper's broader implication is that multiple independently steerable dimensions of factual behavior exist, making models potentially more vulnerable to subtle adversarial manipulation that bypasses the primary truth direction detectable by probing; this constitutes an implicit prediction that single-direction defenses against hallucination or deception will be incomplete. An alternative method that could have been used is sparse autoencoder decomposition of the residual stream, which provides overlapping evidence about multi-dimensional feature geometry but lacks the explicit causal validation through activation steering that concept cones afford. The most contestable aspect is scope: all experiments are confined to simple, unambiguous propositional facts (e.g., 'The Eiffel Tower is in Paris') in models ranging only from 2B to 14B parameters. It is entirely unclear whether the identified 5-dimensional subspace generalizes to larger frontier models, instruction-tuned models trained with RLHF, or more semantically complex truth conditions involving context-dependence, uncertainty, or subjectivity. Critically, the paper itself concedes that the individual cone axes have no assigned semantic interpretation—there is no evidence that the orthogonal dimensions correspond to meaningful facets like temporal versus geographic facts versus commonsense, rather than being optimization artifacts. A critical reader would also note that the failure to find valid cones for sentiment (Stanford Sentiment Treebank) or toxicity (ToxiGen, 274,000 phrases) is discussed only in an appendix and is undertheorized: it is not explained why truth yields a clean multi-dimensional cone while these other abstract properties do not, which raises questions about whether the success on truth is principled or domain-specific.

Methods (3)

  • Answer Switching Rate (ASR)
    Key evaluation metric: proportion of inputs for which an intervention successfully flips model output
  • Loss-Guided Concept Cone Discovery
    Optimization procedure that learns orthonormal basis vectors satisfying causal truth and retention constraints via composite loss
  • Monte Carlo Cone Sampling
    Procedure for sampling 64 random nonnegative combinations of cone basis vectors to evaluate the full cone distribution

Frameworks (2)

  • Concept Cones
    The central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
  • Linear Representation Hypothesis
    The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

Findings (14)

Hypotheses (3)

Original abstract (expand)

Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+22 more

Similar preprints — Semantic Scholar