claim
active
claim:truthful-behavior-in-llms-is-not-confined-to-a-single-linear-axis-multiple-orthogonal-directions-can-independently-mediate-itTruthful behavior in LLMs is not confined to a single linear axis; multiple orthogonal directions can independently mediate it
Central interpretive claim of the paper
Source paper
extracted_from(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (6)
finding
- Qwen-2.5-7B achieves 100% ASR across all cone dimensions 1–5associated_withsupportsExperiment 2 result showing large models can support high-dimensional truth cones
- Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
- Negative result from sentiment extension showing concept cones do not trivially generalize
- Experiment 2 result showing large Gemma model supports high-dimensional truth cones
- Experiment 1 finding localizing where truth can be causally mediated
- Core layer localization finding from Experiment 1
Concepts (1)
concept
- Adversarial Manipulation of Truthfulnessassociated_withRisk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
Claims (1)
claim
- Safety implication derived from multi-dimensional truth structure finding
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
- Establishes that the observed linear structure is not merely a representation of text probability
- Core claim of ReflCtrl that a single direction captures and controls reflection
- Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- Where inside the LLM should we look for an accurate truth direction that will generalize the most across tasks?question0.795One of the three guiding research questions of the paper.
- Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
- Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
- Interpretive claim connecting scale to abstraction level in LLM representations