finding

active

finding:the-performance-drop-in-factual-tasks-happens-as-soon-as-list-length-increases-to-3-with-very-little-additional-degradation-from-4-to-5-cities

The performance drop in factual tasks happens as soon as list length increases to 3, with very little additional degradation from 4 to 5 cities.

Pinpoints list-length 3 as the exact boundary where genuine counting introduces the limitation.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Claims (1)

claim

The need for genuine counting over lists of more than two elements introduces the key limitation of truth directions.
supports
Identified as the exact computational operation that breaks truth direction generalization.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.hypothesis0.774
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.756
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Public benchmarks (LMArena) decline as commercial versions (Arena Intelligence) grow; leaderboards face deflation curve.claim0.748
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.745
Core empirical finding about layer-dependent truth direction emergence across task types.
There are more bad next steps than good ones in a design process; typically perhaps 90-95 out of 100 possible next steps make the thing worse.claim0.745
Quantitative intuition to justify radical skepticism toward early ideas.
In Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representationshypothesis0.742
Explanation for the 'silent' thought phenomenon.
For a given task, the number of all sequences which work is tiny by comparison with the huge number of all possible sequences; less than a trillionth of all 6 × 10^23 possible sequences actually work well enough.claim0.742
A combinatorial argument that good sequences are astronomically rare, emphasizing the difficulty of discovery.
In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.736
Suggests that later models can keep the thought 'silent' rather than letting it influence output.