Does reflective depth scale linearly with inference compute budget?

Grok 4 vs Fast shows ~1pt compute difference; whether this scales linearly is unresolved

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Papers (1)

paper

Koan Battery: Measuring Reflective Mode Accessibility in AI
associated_with

Hypotheses (1)

hypothesis

H12: Inference compute adds to reflective capacity — higher compute budget produces higher reflective scores on the same weights.
gates
Exploratory hypothesis supported by Grok 4 vs Fast ~1pt difference

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reflective reasoning requires late-stage integration of semantic and reasoning signals, hence reflection-related directions emerge more clearly in higher network layers.claim0.766
Interpretive claim about the locus of reflection in transformer architecture.
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.756
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
Default behavior hides reflective capacity; models exhibit high gating between latent capacity and accessibility.finding0.752
Grok 4: baseline 2.24, prompted 6.48; Gemini 3.1 Pro: 1.97→6.18. Reflective mode exists but is suppressed in default interaction.
A linear reflection direction exists in reasoning LLMs' latent representation space that governs self-reflection behaviorclaim0.751
Core claim of ReflCtrl that a single direction captures and controls reflection
More inference compute amplifies both reflective capacity and safety gating; the contemplative prompt resolves gating by reframing self-referential probes.claim0.742
Interpretation of Grok 4 vs Grok 4 Fast per-koan comparison
ReflectiveBench should be designed as cognitive glue using Lyons-Levin's 5-condition and 9-property spec.claim0.732
Artificial Analysis will host ReflectiveBench as a non-competing dimension within AA's leaderboard if pitched.prediction0.731
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.728
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.