paper
active
2026
paper:doi-10-48550-arxiv-2604-03754

Testing the Limits of Truth Directions in LLMs

TL;DR

Linear truth directions in LLMs are reliable primarily for simple factual retrieval and break down as soon as truth assessment requires tracking intermediate results—a finding that sharply constrains universality claims made by Marks & Tegmark (2024) and Bao et al. (2025). Probing Llama-3.1-8B-Instruct (32 layers, d_model=4096) and three additional models from the Llama and Gemma families across a controlled 9-task hierarchy—six factual tasks (F0–F5) and three arithmetic tasks (A1–A3)—reveals that factual truth directions emerge in early to mid layers while arithmetic truth directions emerge continuously through late layers, and that no single layer is universally optimal. The paper introduces a layer-by-layer cross-task generalization evaluation using AUROC, showing that F0-trained probes achieve near-perfect in-domain performance from layer 8 yet exhibit inverted separation (AUROC ≈ 0) on negated variants (F1) at those same layers, with the polarity-dependent direction tp dominating at layer 7 (~0.38 variance explained vs. ~0.09 for the polarity-invariant direction tG). Generalization collapses to near chance as soon as counting is required over lists of length 3, while an F3-trained probe reaches only AUROC ≈ 0.6 on F4; similarly, A1-trained probes degrade significantly on A2 and A2-trained probes achieve only ~0.65 on A3. Switching from a passive no-prompt template to an explicit ask-correct template shifts truth-direction geometry so dramatically that no-prompt probes fail to transfer to ask-correct activations, yet the ask-correct setting enables arithmetic-trained probes to generalize almost perfectly to simple factual tasks F0–F2. The paper argues that universality claims for truth directions are fundamentally bounded by the computational demand of truth assessment, and that conclusions drawn from single-layer, no-instruction, factual-only analyses should not be assumed to extend to settings involving multi-step reasoning or varied prompt formats.

What to take away

  1. 1. Factual truth directions in Llama-3.1-8B-Instruct emerge reliably in early to mid layers (peaking by layer 8 for F0–F3), while arithmetic truth directions (A1–A3) emerge gradually and only reach peak performance in late layers, with the exact transition layer varying by task.
  2. 2. F0-trained probes achieve near-perfect in-domain AUROC from layer 8 but exhibit AUROC ≈ 0 (inverted separation) on the negated task F1 at layers 4–10, meaning they systematically misclassify true negated statements as false.
  3. 3. At layer 7, the polarity-dependent direction tp explains ~0.38 of truth-related variance versus ~0.09 for the polarity-invariant direction tG, confirming that early-layer probes capture sentence polarity rather than truth; by mid layers tG overtakes tp.
  4. 4. The two-dimensional truth subspace reported by Bürger et al. (2024) at layer 12 reflects a transitional phase—at that layer tp and tG explain similar variance fractions (~0.33 each)—rather than a universal property of truth representations.
  5. 5. Generalization collapses to near-chance as soon as counting is required over lists of length 3 cities, with an F3-trained probe reaching only AUROC ≈ 0.6 on the F4 task (5-city lists), identifying the minimum counting operation as the boundary for truth-direction generalization.
  6. 6. For arithmetic tasks, an A2-trained probe achieves only ~0.65 AUROC on A3 (three-operation expressions), demonstrating that each additional operation requiring intermediate result storage degrades generalization independently of the source-task probe complexity.
  7. 7. Switching from a passive no-prompt template to the ask-correct template ("Is the following correct? {statement} Answer:") causes no-prompt probes to fail on ask-correct activations, with cosine similarity between the two sets of directional probes remaining near zero across all layers and tasks.
  8. 8. Under the ask-correct prompt, arithmetic-trained probes (A1–A3) generalize almost perfectly to simple factual tasks F0–F2 (AUROC ≈ 1.00 in the generalization heatmap at layer 25), an effect absent under no-prompt, showing that explicit evaluation framing can partially unify truth directions across task families.
  9. 9. A methodology replicable by other researchers: bias-free logistic probes are trained on mean-centered residual-stream activations at the final token position across all 32 layers using Adam (lr=1e-3, weight decay=0.1, 1000 steps), with 70/30 train/test splits on balanced datasets of up to 2,000 examples per task, and evaluated via AUROC for both in-domain and cross-task transfer.
  10. 10. An open question the paper raises: whether the degraded generalization of truth probes on benchmarks like MMLU—previously attributed to domain diversity or question ambiguity by Bao et al. (2025)—is primarily explained by the computational demand of multi-step reasoning, and whether methods for input-truth and output-truth directions can be jointly leveraged to build reliable truth assessment tools robust to task difficulty.

Peer brief — for seminar discussion

Poulis, Crovella, and Terzi systematically probe the geometry of linear truth directions across all layers of four instruction-tuned LLMs—Llama-3.1-8B-Instruct (32 layers), Llama-3.2-3B-Instruct, Gemma-2-2b-it, and Gemma-2-9b-it—using a purpose-built 9-task hierarchy that controls task difficulty via the number of discrete operations required to verify correctness. The hierarchy spans six factual tasks (F0–F5, ranging from single-fact lookup to double counting over 6-city lists) and three arithmetic tasks (A1–A3, with one to three binary operations over integers in [1,99]). The central method is a layer-by-layer cross-task AUROC evaluation of bias-free logistic linear probes trained on mean-centered residual-stream activations at the final token, complemented by cosine-similarity analysis of probe directions across layer pairs and prompt conditions. The load-bearing finding is a three-way fragmentation of truth-direction universality. First, the layer at which truth becomes linearly separable is task-dependent: simple factual tasks achieve near-perfect probe accuracy by layer 8, while arithmetic tasks only converge in late layers, and no single layer is universally optimal. Second, truth directions break down quantitatively with task difficulty: an F3-trained probe reaches only AUROC ≈ 0.6 on F4 (the 5-city counting variant), and the degradation onset is pinpointed to lists of length 3—one element beyond what can be resolved by pairwise comparison heuristics. For arithmetic, an A2-trained probe achieves approximately 0.65 AUROC on A3, confirming that each additional operator requiring stored intermediate results degrades generalization. Third, prompt framing is a major confound: switching from a passive no-prompt condition to an explicit ask-correct template ("Is the following correct? … Answer:") shifts truth-direction geometry so thoroughly that no-prompt probes fail to transfer to ask-correct activations, yet ask-correct enables near-perfect cross-family generalization from arithmetic probes to simple factual tasks F0–F2 (AUROC ≈ 1.00 at layer 25 in the generalization heatmap). The paper argues that truth directions are fundamentally limited to settings where correctness can be established through factual retrieval, and that the computational demands of multi-step reasoning—not domain diversity or ambiguity per se—explain the weaker generalization previously reported on MMLU and TriviaQA by Bao et al. (2025). A parallel hypothesis, which the work does not resolve, is whether input-truth and output-truth representations are related in ways that could inform more reliable truth-detection pipelines. An alternative method the paper could have used is nonlinear or multi-class probing (as proposed by Savcisens & Eliassi-Rad 2025), which might have captured residual structure in the entangled high-dimensional activations of harder tasks—potentially revealing whether truth information is present but linearly inaccessible rather than absent altogether. The most contestable aspect is the operationalization of task difficulty as operation count. Counting over a 3-city list is treated as categorically harder than a 2-city conjunction, but the models tested may have acquired list-counting competence unevenly depending on pre-training data distribution; the difficulty boundary at list length 3 could reflect a model-specific capability threshold rather than a principled geometric property of truth representations. A critical reader would want to see whether the AUROC collapse at list length 3 persists after controlling for the model's behavioral accuracy on these tasks—if the model itself fails to answer correctly on F4 inputs, the probe's failure may reflect absence of the relevant internal computation rather than a limit of linear separability per se.

Frameworks (2)

Findings (18)

Claims (18)

Hypotheses (3)

Questions (7)

Original abstract (expand)

Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+16 more

Similar preprints — Semantic Scholar