concept
active
concept:polarity-invariant-truth-direction-tgPolarity-invariant truth direction (tG)
A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Burger et al. (2024) framework proposing that truth is linearly decoded along a 2D subspace capturing both polarity-dependent and polarity-invariant directions.
Concepts (1)
concept
- Polarity-dependent truth direction (tP)associated_withrelated_toA direction that classifies affirmative statements effectively but inverts for negated variants, dominating in early layers.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Variance decomposition showing the disentanglement of polarity from truth across model depth.
- A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
- The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
- The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
- Argues against the single-layer analysis approach of prior work.