Polarity-invariant truth direction (tG)

A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Two-dimensional truth subspace
uses
Burger et al. (2024) framework proposing that truth is linearly decoded along a 2D subspace capturing both polarity-dependent and polarity-invariant directions.

Concepts (1)

concept

Polarity-dependent truth direction (tP)
associated_withrelated_to
A direction that classifies affirmative statements effectively but inverts for negated variants, dominating in early layers.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.finding0.786
Variance decomposition showing the disentanglement of polarity from truth across model depth.
Truth Directionconcept0.771
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
Truth direction universalityconcept0.762
The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
Antipodal Alignment of Truth Directionsconcept0.756
The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.725
Motivating hypothesis for Section 5's investigation of prompt template effects.
Truth direction in LLMsconcept0.725
Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
Truth Direction in LLM Latent Spaceconcept0.723
A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.719
Argues against the single-layer analysis approach of prior work.