concept
active
concept:truth-direction

Truth Direction

A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements

Neighborhood — ranked by edge-count

Methods (2)

method
  • Linear Probe
    implements
    Simple linear classifiers trained on model activations used as the probing technique within the introduced method.
  • Difference-in-Means
    associated_with
    Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis

Concepts (6)

concept
  • Truth direction universality
    associated_withrelated_to
    The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
  • Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
  • Linear representation
    associated_with
    The idea that features are encoded as directions in activation space.
  • The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
  • Vector from mean of false representations to mean of true representations; core of mass-mean probing
  • Property of truth directions: probability of truthful response scales monotonically with the strength of the activation addition coefficient

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.