concept
active
concept:truth-directionTruth Direction
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
Neighborhood — ranked by edge-count
Papers (2)
paper
Methods (2)
method
- Linear ProbeimplementsSimple linear classifiers trained on model activations used as the probing technique within the introduced method.
- Difference-in-Meansassociated_withMethod for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
Concepts (6)
concept
- Truth direction universalityassociated_withrelated_toThe claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
- Truth direction in LLMsrelated_toLinear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- Linear representationassociated_withThe idea that features are encoded as directions in activation space.
- Truth SubspaceextendsThe multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
- Difference-in-Means Directionassociated_withVector from mean of false representations to mean of true representations; core of mass-mean probing
- Monotonic Scaling Propertyassociated_withProperty of truth directions: probability of truthful response scales monotonically with the strength of the activation addition coefficient
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign
- A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
- Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
- Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.795Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
- Interpretation of KL divergence retention results
- A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
- Correctness of input statements to an LLM, as opposed to output-truth (correctness of model-generated outputs).
- Research question motivating Section 5.