concept
active
concept:polarity-dependent-truth-direction-tpPolarity-dependent truth direction (tP)
A direction that classifies affirmative statements effectively but inverts for negated variants, dominating in early layers.
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Burger et al. (2024) framework proposing that truth is linearly decoded along a 2D subspace capturing both polarity-dependent and polarity-invariant directions.
Concepts (2)
concept
- Polarity-invariant truth direction (tG)associated_withrelated_toA direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.
- Sentence polarityassociated_withWhether a statement is affirmative or negated; a surface feature that confounds early-layer truth probes.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Variance decomposition showing the disentanglement of polarity from truth across model depth.
- The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
- A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
- Argues against the single-layer analysis approach of prior work.
- Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).
- A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
- Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
- Central empirical conclusion of the paper about the fundamental limits of truth directions.