concept
active
concept:difference-in-means-directionDifference-in-Means Direction
Vector from mean of false representations to mean of true representations; core of mass-mean probing
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Mass-Mean ProbingimplementsIntroduced in this paper: an optimization-free probing technique using difference-in-means direction with optional covariance correction
Concepts (1)
concept
- Truth Directionassociated_withA hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
- Subtle variation and detail, as in pots of flowers, that brings life to a place.
- A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
- Formal consequence of Belrose et al. (2023) Theorem G.1 connecting mass-mean probing to optimal linear concept erasure
- A straight vector in activation space, traditionally used for concept manipulation; claimed to be insufficient when true concept geometry is curved.
- Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Proposed universal invariant of cognition and intelligence—capacity for goal-directed activity in a problem space, independent of substrate or embodiment.