concept
active
concept:truth-direction-in-llmsTruth direction in LLMs
Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
Neighborhood — ranked by edge-count
Thinkers (1)
thinker
- Collin BurnsintroducesDiscovered truth directions in LLMs without supervision; cited for truth probe methodology
Methods (1)
method
- Linear ProbeimplementsSimple linear classifiers trained on model activations used as the probing technique within the introduced method.
Concepts (2)
concept
- Truth Directionrelated_toA hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
- Truth Direction in LLM Latent Spacerelated_toA specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Where inside the LLM should we look for an accurate truth direction that will generalize the most across tasks?question0.876One of the three guiding research questions of the paper.
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
- Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.
- Central interpretive claim of the paper
- The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
- The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
- The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign
- We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.hypothesis0.768Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.