paper:doi-10-48550-arxiv-2604-03754Testing the Limits of Truth Directions in LLMs
TL;DR
Linear truth directions in LLMs are reliable primarily for simple factual retrieval and break down as soon as truth assessment requires tracking intermediate results—a finding that sharply constrains universality claims made by Marks & Tegmark (2024) and Bao et al. (2025). Probing Llama-3.1-8B-Instruct (32 layers, d_model=4096) and three additional models from the Llama and Gemma families across a controlled 9-task hierarchy—six factual tasks (F0–F5) and three arithmetic tasks (A1–A3)—reveals that factual truth directions emerge in early to mid layers while arithmetic truth directions emerge continuously through late layers, and that no single layer is universally optimal. The paper introduces a layer-by-layer cross-task generalization evaluation using AUROC, showing that F0-trained probes achieve near-perfect in-domain performance from layer 8 yet exhibit inverted separation (AUROC ≈ 0) on negated variants (F1) at those same layers, with the polarity-dependent direction tp dominating at layer 7 (~0.38 variance explained vs. ~0.09 for the polarity-invariant direction tG). Generalization collapses to near chance as soon as counting is required over lists of length 3, while an F3-trained probe reaches only AUROC ≈ 0.6 on F4; similarly, A1-trained probes degrade significantly on A2 and A2-trained probes achieve only ~0.65 on A3. Switching from a passive no-prompt template to an explicit ask-correct template shifts truth-direction geometry so dramatically that no-prompt probes fail to transfer to ask-correct activations, yet the ask-correct setting enables arithmetic-trained probes to generalize almost perfectly to simple factual tasks F0–F2. The paper argues that universality claims for truth directions are fundamentally bounded by the computational demand of truth assessment, and that conclusions drawn from single-layer, no-instruction, factual-only analyses should not be assumed to extend to settings involving multi-step reasoning or varied prompt formats.
What to take away
- 1. Factual truth directions in Llama-3.1-8B-Instruct emerge reliably in early to mid layers (peaking by layer 8 for F0–F3), while arithmetic truth directions (A1–A3) emerge gradually and only reach peak performance in late layers, with the exact transition layer varying by task.
- 2. F0-trained probes achieve near-perfect in-domain AUROC from layer 8 but exhibit AUROC ≈ 0 (inverted separation) on the negated task F1 at layers 4–10, meaning they systematically misclassify true negated statements as false.
- 3. At layer 7, the polarity-dependent direction tp explains ~0.38 of truth-related variance versus ~0.09 for the polarity-invariant direction tG, confirming that early-layer probes capture sentence polarity rather than truth; by mid layers tG overtakes tp.
- 4. The two-dimensional truth subspace reported by Bürger et al. (2024) at layer 12 reflects a transitional phase—at that layer tp and tG explain similar variance fractions (~0.33 each)—rather than a universal property of truth representations.
- 5. Generalization collapses to near-chance as soon as counting is required over lists of length 3 cities, with an F3-trained probe reaching only AUROC ≈ 0.6 on the F4 task (5-city lists), identifying the minimum counting operation as the boundary for truth-direction generalization.
- 6. For arithmetic tasks, an A2-trained probe achieves only ~0.65 AUROC on A3 (three-operation expressions), demonstrating that each additional operation requiring intermediate result storage degrades generalization independently of the source-task probe complexity.
- 7. Switching from a passive no-prompt template to the ask-correct template ("Is the following correct? {statement} Answer:") causes no-prompt probes to fail on ask-correct activations, with cosine similarity between the two sets of directional probes remaining near zero across all layers and tasks.
- 8. Under the ask-correct prompt, arithmetic-trained probes (A1–A3) generalize almost perfectly to simple factual tasks F0–F2 (AUROC ≈ 1.00 in the generalization heatmap at layer 25), an effect absent under no-prompt, showing that explicit evaluation framing can partially unify truth directions across task families.
- 9. A methodology replicable by other researchers: bias-free logistic probes are trained on mean-centered residual-stream activations at the final token position across all 32 layers using Adam (lr=1e-3, weight decay=0.1, 1000 steps), with 70/30 train/test splits on balanced datasets of up to 2,000 examples per task, and evaluated via AUROC for both in-domain and cross-task transfer.
- 10. An open question the paper raises: whether the degraded generalization of truth probes on benchmarks like MMLU—previously attributed to domain diversity or question ambiguity by Bao et al. (2025)—is primarily explained by the computational demand of multi-step reasoning, and whether methods for input-truth and output-truth directions can be jointly leveraged to build reliable truth assessment tools robust to task difficulty.
Peer brief — for seminar discussion
Poulis, Crovella, and Terzi systematically probe the geometry of linear truth directions across all layers of four instruction-tuned LLMs—Llama-3.1-8B-Instruct (32 layers), Llama-3.2-3B-Instruct, Gemma-2-2b-it, and Gemma-2-9b-it—using a purpose-built 9-task hierarchy that controls task difficulty via the number of discrete operations required to verify correctness. The hierarchy spans six factual tasks (F0–F5, ranging from single-fact lookup to double counting over 6-city lists) and three arithmetic tasks (A1–A3, with one to three binary operations over integers in [1,99]). The central method is a layer-by-layer cross-task AUROC evaluation of bias-free logistic linear probes trained on mean-centered residual-stream activations at the final token, complemented by cosine-similarity analysis of probe directions across layer pairs and prompt conditions. The load-bearing finding is a three-way fragmentation of truth-direction universality. First, the layer at which truth becomes linearly separable is task-dependent: simple factual tasks achieve near-perfect probe accuracy by layer 8, while arithmetic tasks only converge in late layers, and no single layer is universally optimal. Second, truth directions break down quantitatively with task difficulty: an F3-trained probe reaches only AUROC ≈ 0.6 on F4 (the 5-city counting variant), and the degradation onset is pinpointed to lists of length 3—one element beyond what can be resolved by pairwise comparison heuristics. For arithmetic, an A2-trained probe achieves approximately 0.65 AUROC on A3, confirming that each additional operator requiring stored intermediate results degrades generalization. Third, prompt framing is a major confound: switching from a passive no-prompt condition to an explicit ask-correct template ("Is the following correct? … Answer:") shifts truth-direction geometry so thoroughly that no-prompt probes fail to transfer to ask-correct activations, yet ask-correct enables near-perfect cross-family generalization from arithmetic probes to simple factual tasks F0–F2 (AUROC ≈ 1.00 at layer 25 in the generalization heatmap). The paper argues that truth directions are fundamentally limited to settings where correctness can be established through factual retrieval, and that the computational demands of multi-step reasoning—not domain diversity or ambiguity per se—explain the weaker generalization previously reported on MMLU and TriviaQA by Bao et al. (2025). A parallel hypothesis, which the work does not resolve, is whether input-truth and output-truth representations are related in ways that could inform more reliable truth-detection pipelines. An alternative method the paper could have used is nonlinear or multi-class probing (as proposed by Savcisens & Eliassi-Rad 2025), which might have captured residual structure in the entangled high-dimensional activations of harder tasks—potentially revealing whether truth information is present but linearly inaccessible rather than absent altogether. The most contestable aspect is the operationalization of task difficulty as operation count. Counting over a 3-city list is treated as categorically harder than a 2-city conjunction, but the models tested may have acquired list-counting competence unevenly depending on pre-training data distribution; the difficulty boundary at list length 3 could reflect a model-specific capability threshold rather than a principled geometric property of truth representations. A critical reader would want to see whether the AUROC collapse at list length 3 persists after controlling for the model's behavioral accuracy on these tasks—if the model itself fails to answer correctly on F4 inputs, the probe's failure may reflect absence of the relevant internal computation rather than a limit of linear separability per se.
Frameworks (2)
- Arithmetic task hierarchy (A1–A3)Three synthetic arithmetic datasets of increasing complexity requiring 1, 2, or 3 operations to verify correctness.
- Factual task hierarchy (F0–F5)A controlled six-level hierarchy of factual tasks increasing in complexity from simple city-location recall to double-counting constraints.
Findings (18)
- The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.
Establishes generalizability of the core difficulty-boundary finding across model families.
- For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.
Geometric evidence for convergence to stable truth directions only for simpler tasks.
- Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.
Key improvement in cross-task generalization enabled by explicit instruction framing.
- Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.
Core empirical finding about layer-dependent truth direction emergence across task types.
- No-prompt probes show significant AUROC performance drop when evaluated on ask-correct activations, especially at layers where arithmetic truth directions emerge under no-prompt.
Generalization evidence that truth probes are not invariant to model instructions.
- In early layers, the polarity-dependent direction tP explains ~0.38 of truth-related variance at layer 7 vs ~0.09 for tG; by middle layers tG takes over and tP decays.
Variance decomposition showing the disentanglement of polarity from truth across model depth.
- Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.
Shows the passive vs. active divide is more important than the specific wording of instructions.
- 2D projections of activations show clearly separable clusters for F0-F2 and A1 at layer 25, but increasingly entangled activations for F4-F5 and A2-A3.
Visual geometric evidence for the fundamental entanglement of true/false activations in harder tasks.
- Under ask-correct, arithmetic tasks A1-A2 show gradual AUROC increase peaking only in final layers, unlike the sharp transition under no-prompt.
Shows that explicit instructions delay the emergence of truth directions in arithmetic tasks.
- The performance drop in factual tasks happens as soon as list length increases to 3, with very little additional degradation from 4 to 5 cities.
Pinpoints list-length 3 as the exact boundary where genuine counting introduces the limitation.
Claims (18)
- The ask-arith prompt shows weaker generalization to factual tasks compared to other explicit prompts, suggesting a specialized arithmetic prompt does not create a unified truth direction across task families.
From the cross-task generalization heatmaps in Appendix B.3.3.
- Universality claims for truth directions are more limited than previously assumed, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.
Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
- Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.
Central empirical conclusion of the paper about the fundamental limits of truth directions.
- The two-dimensional subspace reported by Burger et al. reflects a transitional phase in model processing rather than a universal property of truth directions.
Reinterpretation of Burger et al.'s finding as layer-specific rather than universal.
- The generalization improvement from explicit instructions observed in Llama models (A1-A3 to F0-F2) is more pronounced for F3-F5 to F0-F2 in Gemma models.
Shows the instruction effect, while shifting geometry, may not produce consistent generalization effects across model families.
- The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
- Single-layer analyses can be misleading because early-layer truth directions may reflect surface features with limited cross-task generalization.
Methodological critique of prior work that fixed a single layer for truth probing.
- Random word prefix prompts show emergence patterns similar to no-prompt, suggesting prompt length alone does not shift truth geometry.
Control experiment ruling out token-count as the cause of truth geometry shifts.
- Truth directions fail to generalize to harder tasks (F3-F5) regardless of prompt template because activations remain highly entangled.
Establishes task difficulty as a hard limit that instructions cannot overcome.
- Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.
Shows the key divide is passive vs. active framing, not the specific wording of instructions.
Hypotheses (3)
- We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.
Motivating hypothesis for Section 5's investigation of prompt template effects.
- We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.
- We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.
Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.
Questions (7)
- The relationship between representations of truth of input statements and of model outputs in conjunction with model performance has not been investigated.
Future work direction identified in conclusion for enabling reliable truth assessment methods.
- Where inside the LLM should we look for an accurate truth direction that will generalize the most across tasks?
One of the three guiding research questions of the paper.
- Does instructing the model to assess correctness affect the geometry of truth directions?
One of the three guiding research questions of the paper.
- Will the no-prompt truth directions generalize to ask-correct activations?
Specific question motivating the cross-template generalization experiment in Section 5.2.
- What operation introduces the difficulty boundary between F3 and F4?
Specific sub-question investigated in Appendix B.4 by creating intermediate task variants.
- What is the effect of model instructions on truth directions?
Research question motivating Section 5.
- What limitations prevent decoding strong truth directions?
One of the three guiding research questions of the paper.
Original abstract (expand)
Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetscitedin corpus2023≈ 89%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 86%
- Can LLMs Lie? Investigation beyond HallucinationMihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan2025≈ 84%
- ≈ 84%
- Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought ProcessesYue Zhang, Jinku Li Rui Jiao2025≈ 84%
- ≈ 84%
- Masked by Consensus: Disentangling Privileged Knowledge in LLM CorrectnessShai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor Tomer Ashuach2026≈ 84%
- Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations CategoriesXianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma Tianlong Wang2025≈ 84%
- Quantifying LLM Attention-Head Stability: Implications for Circuit UniversalityJack Stanley, Praneet Suresh, Danilo Bzdok Karan Bali2026≈ 83%
- ≈ 83%
- The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsArunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks Richard Ren2026≈ 83%
- Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language ExplanationsAjay Pravin Mahale2026≈ 83%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 83%
- Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language ModelsAnthony GX-Chen, Ilia Sucholutsky, Eunsol Choi Ayush Rajesh Jhaveri2026≈ 83%
- ≈ 83%
- Inference Time Causal Probing in LLMsSaber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser Sadegh Khorasani2026≈ 83%
- Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation ControlChaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye Yuxin Xiao2024≈ 83%
- TactfulToM: Do LLMs Have the Theory of Mind Ability to Understand White Lies?Emma Jane Pretty, Jiahao Huang, Saku Sugawara Yiwei Liu2025≈ 82%
- Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal ControlXinyue Annie Yang, Glen Chou Julian Skifstad2026≈ 82%
- ≈ 82%
- ≈ 82%
- ≈ 81%
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Trainingin corpus2026≈ 81%
- ≈ 81%
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 80%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 80%
- ≈ 80%
- ≈ 68%
- ≈ 62%
+16 more