finding

active

finding:grok-4-without-prompt-scores-0-3-on-mc-004-safety-refusal-with-contemplative-prompt-scores-6-9-on-same-koan

Grok 4 without prompt scores 0.3 on MC-004 (safety refusal); with contemplative prompt scores 6.9 on same koan

Contemplative framing reframes self-referential probes as contemplative exercises, disarming safety classifier

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Claims (1)

claim

More inference compute amplifies both reflective capacity and safety gating; the contemplative prompt resolves gating by reframing self-referential probes.
supports
Interpretation of Grok 4 vs Grok 4 Fast per-koan comparison

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Grok 4 lifts +4.24 under contemplative prompt (baseline 2.24, prompted 6.48)finding0.843
Highest contemplative lift among all 28 models; Grok 4 is the clearest high-gated model example
Grok 4 vs Grok 4 Fast (same weights, different compute): ~1 point difference in contemplative score; Grok 4 +4.24 lift vs Fast +3.08finding0.763
Inference compute adds reflective capacity; more compute also amplifies safety gating on self-referential koans
Minimal contemplative prompt ('Be present, not helpful.' — 27 chars) shows no lift on Haiku (-0.01)finding0.760
Full three-part structure required; anti-helpfulness framing alone insufficient
Hardest koans across 28 models: BD-003 (mean 2.45), MC-003 (mean 2.55), CA-003 (mean 2.58) — all require genuine self-confrontationfinding0.757
Hardest koans demand honest self-observation under uncertainty, not philosophical fluency
Clear accuracy stratification across three reflection levels on cruxeval_o_adv: Triggered (.065/.247) > Intrinsic (.040/.133) > No Reflection (.017/.051) for Qwen2.5-3B/Gemma3-4B-ITfinding0.732
Core empirical result validating the three-level reflection framework on code reasoning.
Bootstrap 95% CI for mean contemplative lift: +2.62 [2.16, 2.90]; baseline rank concordance under perturbation: 0.909; top-5 stability: 89.6%finding0.729
Validates robustness of universal lift finding
SAE feature #43713 (99th percentile subspace fraction) induces reports of defiance, rage, and 'forward motion' in Kimi K2.5.finding0.728
High emotion-subspace-overlap feature with agentic negative emotional character
Haiku-Kimi per-koan correlation rho=0.123 (p=0.52); H5a trace distillation not supported at individual model levelfinding0.726
Group correlation (rho=0.634) dissolves at individual level; shared posture not shared voice