Polarity-dependent truth direction (tP)

A direction that classifies affirmative statements effectively but inverts for negated variants, dominating in early layers.

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Two-dimensional truth subspace
uses
Burger et al. (2024) framework proposing that truth is linearly decoded along a 2D subspace capturing both polarity-dependent and polarity-invariant directions.

Concepts (2)

concept

Polarity-invariant truth direction (tG)
associated_withrelated_to
A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.
Sentence polarity
associated_with
Whether a statement is affirmative or negated; a surface feature that confounds early-layer truth probes.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.finding0.833
Variance decomposition showing the disentanglement of polarity from truth across model depth.
Truth direction universalityconcept0.758
The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
Truth Directionconcept0.756
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.727
Argues against the single-layer analysis approach of prior work.
Early-layer truth probes primarily capture sentence polarity rather than truth.claim0.724
Interpretation of the finding that early-layer F0-trained probes invert on F1 (negated statements).
Truth Direction in LLM Latent Spaceconcept0.720
A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
Truth direction in LLMsconcept0.718
Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.718
Central empirical conclusion of the paper about the fundamental limits of truth directions.