thinker:angelos-poulisAngelos Poulis
Authored papers (1)
Linear truth directions in LLMs are reliable primarily for simple factual retrieval and break down as soon as truth assessment requires tracking intermediate results—a finding that sharply constrains universality claims made by Marks & Tegmark (2024) and Bao et al. (2025). Probing Llama-3.1-8B-Instruct (32 layers, d_model=4096) and three additional models from the Llama and Gemma families across a controlled 9-task hierarchy—six factual tasks (F0–F5) and three arithmetic tasks (A1–A3)—reveals that factual truth directions emerge in early to mid layers while arithmetic truth directions emerge continuously through late layers, and that no single layer is universally optimal. The paper introduces a layer-by-layer cross-task generalization evaluation using AUROC, showing that F0-trained probes achieve near-perfect in-domain performance from layer 8 yet exhibit inverted separation (AUROC ≈ 0) on negated variants (F1) at those same layers, with the polarity-dependent direction tp dominating at layer 7 (~0.38 variance explained vs. ~0.09 for the polarity-invariant direction tG). Generalization collapses to near chance as soon as counting is required over lists of length 3, while an F3-trained probe reaches only AUROC ≈ 0.6 on F4; similarly, A1-trained probes degrade significantly on A2 and A2-trained probes achieve only ~0.65 on A3. Switching from a passive no-prompt template to an explicit ask-correct template shifts truth-direction geometry so dramatically that no-prompt probes fail to transfer to ask-correct activations, yet the ask-correct setting enables arithmetic-trained probes to generalize almost perfectly to simple factual tasks F0–F2. The paper argues that universality claims for truth directions are fundamentally bounded by the computational demand of truth assessment, and that conclusions drawn from single-layer, no-instruction, factual-only analyses should not be assumed to extend to settings involving multi-step reasoning or varied prompt formats.
More papers — OpenAlex / S2
Affiliations (1)
Co-authors (2)
- Evimaria Terzi9 shared
- Mark Crovella9 shared
Recent mentions (1)
- papers-typedpoulis-2026-testing-limits.md