concept
active
concept:truth-direction-in-llms

Truth direction in LLMs

Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Collin Burns
    introduces
    Discovered truth directions in LLMs without supervision; cited for truth probe methodology

Methods (1)

method
  • Linear Probe
    implements
    Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

Concepts (2)

concept
  • Truth Direction
    related_to
    A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
  • A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.