Truth Subspace

The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Concept Cones
introduces
The central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors

Concepts (2)

concept

Truth Direction
extends
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
Orthonormal Basis Vectors
associated_with
The set of mutually orthogonal unit vectors that span the concept cone, each independently causally mediating target behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Two-dimensional truth subspaceframework0.904
Burger et al. (2024) framework proposing that truth is linearly decoded along a 2D subspace capturing both polarity-dependent and polarity-invariant directions.
Behaviorally Binary Subspaceconcept0.798
A vector subspace that causally impacts outputs only through the sign of its values, enabling harmless magnitude divergence
How can we discover a maximally informative or interpretable truth subspace rather than just a sufficient one?question0.789
Limitation-driven open question about subspace optimality
Emotion Subspaceconcept0.789
The subspace of activation space spanned by the 171 orthogonalized emotion probe vectors, used to measure SAE feature emotional alignment
Subspace DASmethod0.787
Extension of DAS that learns a second rotation matrix on top of a fixed first one to decompose representations into sub-representations.
Balanced Subspacesconcept0.767
Subspaces whose contributions to a layer's output are canceled by opposing weight values, making them non-causally active under natural inputs
The two-dimensional subspace reported by Burger et al. (2024) seems to reflect a stage of transition in the model's processing, rather than a universal property of truth directions.quote0.757
Load-bearing interpretive claim about the layer-specificity of Burger et al.'s finding.
Subspace Decomposition of Representationsconcept0.747
Investigation of whether a distributed representation can be further decomposed into sub-representations encoding component identities.