Balanced Subspaces

Subspaces whose contributions to a layer's output are canceled by opposing weight values, making them non-causally active under natural inputs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Subspace DASmethod0.777
Extension of DAS that learns a second rotation matrix on top of a fixed first one to decompose representations into sub-representations.
Behaviorally Binary Subspaceconcept0.771
A vector subspace that causally impacts outputs only through the sign of its values, enabling harmless magnitude divergence
Truth Subspaceconcept0.767
The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
Emotion Subspaceconcept0.753
The subspace of activation space spanned by the 171 orthogonalized emotion probe vectors, used to measure SAE feature emotional alignment
Subspace Decomposition of Representationsconcept0.753
Investigation of whether a distributed representation can be further decomposed into sub-representations encoding component identities.
Subspace Interventionconcept0.748
Intervention targeting specific dimensional subsets of activation vectors rather than full representations
Two-dimensional truth subspaceframework0.745
Burger et al. (2024) framework proposing that truth is linearly decoded along a 2D subspace capturing both polarity-dependent and polarity-invariant directions.
Intervention on a balanced subspace dimension while holding others fixed crosses the decision boundary using a non-native mechanismfinding0.736
Additional synthetic example of pernicious divergence from balanced subspaces