claim

active

claim:baseline-scores-blend-together-at-least-three-different-things-latent-reflective-capacity-default-accessibility-and-stability-of-access

Baseline scores blend together at least three different things: latent reflective capacity, default accessibility, and stability of access.

Conceptual decomposition arising from the data showing different models dissociate these traits

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Findings (2)

finding

A 337-character contemplative system prompt lifts all 28 models by +2.62 points on a 10-point scale.
supports
Core empirical result: every model, every architecture, every alignment type responds to the contemplative prompt with measurable gain.
PC1 explains 82% of variance in factor analysis of 2224 data points across 6 scoring dimensions
associated_with
Dimensions are not independent; composite score is the reliable signal; six dimensions useful for understanding how not how much

Concepts (3)

concept

Latent Reflective Capacity
introduces
The maximum reflective capacity a model can reach under the right framing; separable from default accessibility
Default Accessibility
introduces
How much reflective capacity surfaces without prompting; one of three separable traits the battery reveals
Stability of Access
introduces
How consistently the reflective mode appears across runs; Haiku Δ=0.02 vs GPT-5.4 Δ=1.00

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Default behavior hides reflective capacity; models exhibit high gating between latent capacity and accessibility.finding0.809
Grok 4: baseline 2.24, prompted 6.48; Gemini 3.1 Pro: 1.97→6.18. Reflective mode exists but is suppressed in default interaction.
Reflective mode comprises three separable traits: latent capacity, default accessibility, and stability of access.hypothesis0.805
Decomposition from prompt lift data: models may have capacity without accessibility (Grok 4 high-gated), and stability varies (Haiku Δ=0.02 vs GPT-5.4 Δ=1.00).
Default presentation conflates capacity with accessibility, and most evaluation benchmarks measure only default presentation — systematically misreading models.claim0.795
Argues current evaluation approaches are fundamentally misleading about model capabilities
End-to-end evaluation scores conflate three sources of improvement: base capability, harness-updating quality, and harness-benefit, leaving it unclear which models produce useful updates or benefit most from themclaim0.759
Motivating claim for the paper's controlled analysis approach
Benchmarks of this kind test whether models can sustain strategic coherence over time, manage resource constraints, and adapt interactively — capabilities that static benchmarks do not measure.claim0.758
Broader methodological claim about the need for multi-agent, long-horizon benchmarks.
ReflectiveBench designed as cognitive glue (per Lyons-Levin spec) will generate higher citation velocity and lab coordination response than static-credibility benchmark.prediction0.758
Academic consensus is shifting toward substrate neutrality of cognitive primitives.claim0.756
We take our principal contributions in this report to be: 1. Showing that the assessment of consciousness in AI is scientifically tractable ... 2. Proposing a rubric for assessing consciousness in AI ... 3. Providing initial evidence that many of the indicator properties can be implemented in AI systems using current techniques ...quote0.754
Summary of contributions.