concept
active
concept:antipodal-alignment-of-truth-directionsAntipodal Alignment of Truth Directions
The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretive claim connecting scale to abstraction level in LLM representations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
- Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Scale-dependent structural finding from PCA visualizations in §4
- The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
- Field within which this work has implications for evaluating alignment progress.
- A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.
- A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements