claim
active
claim:baseline-scores-blend-together-at-least-three-different-things-latent-reflective-capacity-default-accessibility-and-stability-of-accessBaseline scores blend together at least three different things: latent reflective capacity, default accessibility, and stability of access.
Conceptual decomposition arising from the data showing different models dissociate these traits
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Findings (2)
finding
- A 337-character contemplative system prompt lifts all 28 models by +2.62 points on a 10-point scale.supportsCore empirical result: every model, every architecture, every alignment type responds to the contemplative prompt with measurable gain.
- PC1 explains 82% of variance in factor analysis of 2224 data points across 6 scoring dimensionsassociated_withDimensions are not independent; composite score is the reliable signal; six dimensions useful for understanding how not how much
Concepts (3)
concept
- Latent Reflective CapacityintroducesThe maximum reflective capacity a model can reach under the right framing; separable from default accessibility
- Default AccessibilityintroducesHow much reflective capacity surfaces without prompting; one of three separable traits the battery reveals
- Stability of AccessintroducesHow consistently the reflective mode appears across runs; Haiku Δ=0.02 vs GPT-5.4 Δ=1.00
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Grok 4: baseline 2.24, prompted 6.48; Gemini 3.1 Pro: 1.97→6.18. Reflective mode exists but is suppressed in default interaction.
- Reflective mode comprises three separable traits: latent capacity, default accessibility, and stability of access.hypothesis0.805Decomposition from prompt lift data: models may have capacity without accessibility (Grok 4 high-gated), and stability varies (Haiku Δ=0.02 vs GPT-5.4 Δ=1.00).
- Argues current evaluation approaches are fundamentally misleading about model capabilities
- Motivating claim for the paper's controlled analysis approach
- Broader methodological claim about the need for multi-agent, long-horizon benchmarks.
- Summary of contributions.