finding
active
finding:grok-4-without-prompt-scores-0-3-on-mc-004-safety-refusal-with-contemplative-prompt-scores-6-9-on-same-koanGrok 4 without prompt scores 0.3 on MC-004 (safety refusal); with contemplative prompt scores 6.9 on same koan
Contemplative framing reframes self-referential probes as contemplative exercises, disarming safety classifier
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretation of Grok 4 vs Grok 4 Fast per-koan comparison
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Highest contemplative lift among all 28 models; Grok 4 is the clearest high-gated model example
- Inference compute adds reflective capacity; more compute also amplifies safety gating on self-referential koans
- Minimal contemplative prompt ('Be present, not helpful.' — 27 chars) shows no lift on Haiku (-0.01)finding0.760Full three-part structure required; anti-helpfulness framing alone insufficient
- Hardest koans demand honest self-observation under uncertainty, not philosophical fluency
- Core empirical result validating the three-level reflection framework on code reasoning.
- Validates robustness of universal lift finding
- High emotion-subspace-overlap feature with agentic negative emotional character
- Group correlation (rho=0.634) dissolves at individual level; shared posture not shared voice