concept
active
concept:truth-direction-universalityTruth direction universality
The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
Neighborhood — ranked by edge-count
Papers (1)
paper
- Testing the Limits of Truth Directions in LLMscontradictsintroduces
Findings (3)
finding
- Shows the passive vs. active divide is more important than the specific wording of instructions.
- Demonstrates that early-layer probes capture sentence polarity rather than truth.
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Concepts (1)
concept
- Truth Directionassociated_withrelated_toA hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
- Argues against the single-layer analysis approach of prior work.
- Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- Identified as the exact computational operation that breaks truth direction generalization.
- Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.770Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
- Interpretation of KL divergence retention results
- The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign
- A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.