Angelos Poulis

orcid 0009-0000-5466-3406 openalex A5093278347 name_hash 84a5ee3436c12ef76425235c…

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

Testing the Limits of Truth Directions in LLMs2026
Linear truth directions in LLMs are reliable primarily for simple factual retrieval and break down as soon as truth assessment requires tracking intermediate results—a finding that sharply constrains universality claims made by Marks & Tegmark (2024) and Bao et al. (2025). Probing Llama-3.1-8B-Instruct (32 layers, d_model=4096) and three additional models from the Llama and Gemma families across a controlled 9-task hierarchy—six factual tasks (F0–F5) and three arithmetic tasks (A1–A3)—reveals that factual truth directions emerge in early to mid layers while arithmetic truth directions emerge continuously through late layers, and that no single layer is universally optimal. The paper introduces a layer-by-layer cross-task generalization evaluation using AUROC, showing that F0-trained probes achieve near-perfect in-domain performance from layer 8 yet exhibit inverted separation (AUROC ≈ 0) on negated variants (F1) at those same layers, with the polarity-dependent direction tp dominating at layer 7 (~0.38 variance explained vs. ~0.09 for the polarity-invariant direction tG). Generalization collapses to near chance as soon as counting is required over lists of length 3, while an F3-trained probe reaches only AUROC ≈ 0.6 on F4; similarly, A1-trained probes degrade significantly on A2 and A2-trained probes achieve only ~0.65 on A3. Switching from a passive no-prompt template to an explicit ask-correct template shifts truth-direction geometry so dramatically that no-prompt probes fail to transfer to ask-correct activations, yet the ask-correct setting enables arithmetic-trained probes to generalize almost perfectly to simple factual tasks F0–F2. The paper argues that universality claims for truth directions are fundamentally bounded by the computational demand of truth assessment, and that conclusions drawn from single-layer, no-instruction, factual-only analyses should not be assumed to extend to settings involving multi-step reasoning or varied prompt formats.

More papers — OpenAlex / S2

Affiliations (1)

Department of Computer Science, Boston University(institute)

Co-authors (2)

Evimaria Terzi9 shared
Mark Crovella9 shared

Recent mentions (1)

papers-typed
poulis-2026-testing-limits.md