finding

active

finding:negative-control-precise-analytical-assistant-suppresses-scores-haiku-0-64-gpt-5-4-1-06

Negative control ('precise analytical assistant') suppresses scores: Haiku -0.64, GPT-5.4 -1.06

Confirms specificity of contemplative prompt; analytical framing increases task focus at expense of self-observation

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Claims (1)

claim

The active ingredient of the contemplative prompt is its full three-part structure: pause instruction + attention direction + purpose reframing working together.
supports
Mechanistic interpretation supported by control experiments showing partial prompts fail

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Negative steering control achieves liar score of 0.95 in Experiment 2 Appendix example, representing near-complete fabricationfinding0.754
Extreme end of deception induction demonstrating near-complete fabrication of false narratives
Negative Control: Dense Off-Task Anchorsmethod0.738
E3 robustness test: dense but off-task anchors yield high ρd AND high dr, confirming mismatch dominates S
Meta-analysis demonstrates negative affect, physical pain, and cognitive control activate an overlapping region of the anterior midcingulate cortex functioning as a domain-general evaluative hubfinding0.718
Meta-analytic convergence supporting inseparability of evaluative and affective processing in ACC
AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)concept0.718
Related work studying capability of LLMs to subvert safety measures if severely misaligned
Self-referential processing yields significantly higher self-awareness scores than conceptual control on paradoxical reasoning: t(399)=14.90, p=3.0×10⁻⁴⁰finding0.717
Experiment 4 result ruling out semantic priming as explanation for the experimental effect
GPT is corrigible in a negative sense because the agent specification (prompt) is not fixed by the policy and the policy lacks direct training incentives to control its prompt.claim0.709
GPT's corrigibility explained.
Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.finding0.706
Feature manipulation alters persona.
Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplificationfinding0.706
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime