concept
active
concept:truth-subspaceTruth Subspace
The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Concept ConesintroducesThe central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
Concepts (2)
concept
- Truth DirectionextendsA hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
- Orthonormal Basis Vectorsassociated_withThe set of mutually orthogonal unit vectors that span the concept cone, each independently causally mediating target behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Burger et al. (2024) framework proposing that truth is linearly decoded along a 2D subspace capturing both polarity-dependent and polarity-invariant directions.
- A vector subspace that causally impacts outputs only through the sign of its values, enabling harmless magnitude divergence
- How can we discover a maximally informative or interpretable truth subspace rather than just a sufficient one?question0.789Limitation-driven open question about subspace optimality
- The subspace of activation space spanned by the 171 orthogonalized emotion probe vectors, used to measure SAE feature emotional alignment
- Extension of DAS that learns a second rotation matrix on top of a fixed first one to decompose representations into sub-representations.
- Subspaces whose contributions to a layer's output are canceled by opposing weight values, making them non-causally active under natural inputs
- Load-bearing interpretive claim about the layer-specificity of Burger et al.'s finding.
- Investigation of whether a distributed representation can be further decomposed into sub-representations encoding component identities.