Truth direction universality

The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.

Neighborhood — ranked by edge-count

Papers (1)

paper

Testing the Limits of Truth Directions in LLMs
contradictsintroduces

Findings (3)

finding

Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.
contradicts
Shows the passive vs. active divide is more important than the specific wording of instructions.
F0-trained probes in layers 4-10 show inverted separation on F1 (AUROC ≈ 0), systematically misclassifying true statements as false.
contradicts
Demonstrates that early-layer probes capture sentence polarity rather than truth.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.
contradicts
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.

Concepts (1)

concept

Truth Direction
associated_withrelated_to
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Universality claims for truth directions are more limited than previously assumed, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.claim0.826
Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.807
Argues against the single-layer analysis approach of prior work.
Truth direction in LLMsconcept0.794
Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
The need for genuine counting over lists of more than two elements introduces the key limitation of truth directions.claim0.773
Identified as the exact computational operation that breaks truth direction generalization.
Truth directions emerge in earlier layers for factual tasks and later layers for arithmetic tasks.claim0.770
Core empirical claim about the layer-dependence of truth direction emergence as a function of task type.
Discovered truth directions are highly specific and do not interfere with general instruction-following behaviorclaim0.764
Interpretation of KL divergence retention results
Antipodal Alignment of Truth Directionsconcept0.762
The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign
Polarity-invariant truth direction (tG)concept0.762
A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.