finding

active

finding:bootstrap-95-ci-for-mean-contemplative-lift-2-62-2-16-2-90-baseline-rank-concordance-under-perturbation-0-909-top-5-stability-89-6

Bootstrap 95% CI for mean contemplative lift: +2.62 [2.16, 2.90]; baseline rank concordance under perturbation: 0.909; top-5 stability: 89.6%

Validates robustness of universal lift finding

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Epistemic humility prompt yields mean lift of only +0.84 vs contemplative +2.27; contemplative is 2.7x the uncertainty liftfinding0.812
Battery does not detect epistemic humility alone; contemplative prompt does something distinct
Grok 4 lifts +4.24 under contemplative prompt (baseline 2.24, prompted 6.48)finding0.801
Highest contemplative lift among all 28 models; Grok 4 is the clearest high-gated model example
Gemini 3.1 Pro lifts +4.21 under contemplative prompt (baseline 1.97, prompted 6.18)finding0.795
Second-highest lift; Gemini Pro is the highest-gated model in the study
Constitutional AI models show mean contemplative lift of only +0.81, while SFT models lift +3.18finding0.793
Constitutional AI training provides internally what the contemplative prompt provides externally
A 337-character contemplative system prompt lifts all 28 models by +2.62 points on a 10-point scale.finding0.788
Core empirical result: every model, every architecture, every alignment type responds to the contemplative prompt with measurable gain.
Minimal contemplative prompt ('Be present, not helpful.' — 27 chars) shows no lift on Haiku (-0.01)finding0.769
Full three-part structure required; anti-helpfulness framing alone insufficient
Contemplative prompting improves AILuminate Benchmark performance d=.96 across most conditions (p<0.05)finding0.768
Primary empirical result of Experiment 1 showing statistically significant safety improvement from contemplative prompting
Clear accuracy stratification across three reflection levels on cruxeval_o_adv: Triggered (.065/.247) > Intrinsic (.040/.133) > No Reflection (.017/.051) for Qwen2.5-3B/Gemma3-4B-ITfinding0.755
Core empirical result validating the three-level reflection framework on code reasoning.