Antipodal Alignment of Truth Directions

The case where two datasets (e.g., larger_than and smaller_than) separate along opposite directions in PCA, indicating a shared feature with opposite sign

Neighborhood — ranked by edge-count

Claims (1)

claim

As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputs
associated_with
Interpretive claim connecting scale to abstraction level in LLM representations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Truth Directionconcept0.845
A hypothesized direction in LLM activation space that encodes the truth or falsehood of factual statements
Truth direction in LLMsconcept0.786
Linear direction in LLM activations associated with truthfulness, identified by Burns et al. 2022 and Azaria & Mitchell 2023
Alignmentconcept0.781
The goal of making model behavior match human values and intentions, often addressed during post-training.
Antipodal alignment between related datasets (e.g., larger_than and smaller_than) in smaller models resolves to common-direction alignment in larger modelsclaim0.771
Scale-dependent structural finding from PCA visualizations in §4
Truth direction universalityconcept0.762
The claim that truth directions are consistent and generalizable across layers, tasks, and prompt formats in LLMs.
AI alignmentconcept0.759
Field within which this work has implications for evaluating alignment progress.
Polarity-invariant truth direction (tG)concept0.756
A direction that classifies truth irrespective of sentence polarity, emerging and dominating in middle-to-late layers.
Truth Direction in LLM Latent Spaceconcept0.751
A specific direction in an LLM's residual stream that encodes the truth or falsehood of factual statements