hypothesis

active

hypothesis:individual-cone-basis-vectors-may-correspond-to-interpretable-semantic-facets-of-truth-such-as-temporal-facts-geographic-facts-or-commonsense

Individual cone basis vectors may correspond to interpretable semantic facets of truth such as temporal facts, geographic facts, or commonsense

Future direction hypothesis for giving semantic meaning to individual axes

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Neighborhood — ranked by edge-count

Papers (1)

paper

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
introduces

Questions (1)

question

What semantic labels correspond to the individual basis vectors of the truth cone?
gates
Central open question for future work on interpretability of cone axes

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Given a true propositional input (e.g., 'Paris is the capital of France'), ablating along any basis vector of this cone disrupts the model's ability to generate a truthful response.quote0.780
Load-bearing illustration of what a concept cone for truth means operationally
Concept cone truth interventions would generalize to larger frontier models and multimodal settingshypothesis0.759
Key robustness question raised as future work
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.755
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.750
Appendix E replication of DIM alignment finding in Qwen model
Anthropic Interpretability Team: 171 emotion vectors causally influence behavior; performing vs having functional emotion representation are measurably differentfinding0.749
Cited as activation-level support for the performing care vs having care distinction the battery detects behaviorally
There is a many-to-many mapping between neurons and concepts, meaning multiple high-level causal variables might be encoded in overlapping groups of neuronsclaim0.749
Fundamental theoretical claim motivating DAS, attributed to Smolensky/Rumelhart/McClelland.
Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.746
Explanation for why dictionary learning can recover many more features than dimensions.
Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axisclaim0.745
Interpretive synthesis of DIM and cone intervention successes