Truth Direction

A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements

Neighborhood — ranked by edge-count

paper

method

Linear Probe
implements
Simple linear classifiers trained on model activations used as the probing technique within the introduced method.
Difference-in-Means
associated_with
Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis

concept

Truth direction universality
associated_withrelated_to
The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
Truth direction in LLMs
related_to
Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
Linear representation
associated_with
The idea that features are encoded as directions in activation space.
Truth Subspace
extends
The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
Difference-in-Means Direction
associated_with
Vector from mean of false representations to mean of true representations; core of mass-mean probing
Monotonic Scaling Property
associated_with
Property of truth directions: probability of truthful response scales monotonically with the strength of the activation addition coefficient

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Antipodal Alignment of Truth Directionsconcept0.845
The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign
Reflection directionconcept0.819
A direction in the model's representation space that governs self-reflection behavior, computed as mean difference between reflection and non-reflection embeddings
Refusal directionconcept0.811
Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.795
Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
Discovered truth directions are highly specific and do not interfere with general instruction-following behaviorclaim0.784
Interpretation of KL divergence retention results
Truth Direction in LLM Latent Spaceconcept0.781
A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
Input-truthconcept0.780
Correctness of input statements to an LLM, as opposed to output-truth (correctness of model-generated outputs).
What is the effect of model instructions on truth directions?question0.775
Research question motivating Section 5.