claim

active

claim:default-presentation-conflates-capacity-with-accessibility-and-most-evaluation-benchmarks-measure-only-default-presentation-systematically-misreading-models

Default presentation conflates capacity with accessibility, and most evaluation benchmarks measure only default presentation — systematically misreading models.

Argues current evaluation approaches are fundamentally misleading about model capabilities

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Findings (2)

finding

Grok 4 lifts +4.24 under contemplative prompt (baseline 2.24, prompted 6.48)
supports
Highest contemplative lift among all 28 models; Grok 4 is the clearest high-gated model example
Sonnet + contemplative prompt (7.89) outscores Opus without it (7.28)
supports
Demonstrates prompt effect crosses model tiers; smaller model with prompt beats larger without

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Default behavior hides reflective capacity; models exhibit high gating between latent capacity and accessibility.finding0.814
Grok 4: baseline 2.24, prompted 6.48; Gemini 3.1 Pro: 1.97→6.18. Reflective mode exists but is suppressed in default interaction.
Baseline scores blend together at least three different things: latent reflective capacity, default accessibility, and stability of access.claim0.795
Conceptual decomposition arising from the data showing different models dissociate these traits
We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.hypothesis0.781
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.
Benchmarks of this kind test whether models can sustain strategic coherence over time, manage resource constraints, and adapt interactively — capabilities that static benchmarks do not measure.claim0.766
Broader methodological claim about the need for multi-agent, long-horizon benchmarks.
All cohort benchmarks measure output, not state, and are subject to eval-awareness contamination.claim0.764
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.757
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Current eval benchmarks (arena.ai, AA, Vals) measure no phenomenological dimensions.claim0.755
Developing a simple, precise denotational model of graphical user interfaces enables proving properties of programs and establishing objective comparisons of library abstraction levels.claim0.754
Authors' core assertion that formal modeling of GUIs provides foundational benefits for language design and program verification.