Neutral instruction control prompt (read-prompt)

Control prompt 'Read the following sentence...' to test generic instruction-following effects.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Neutral Prompt Template (Template Tb, Experiment 1)method0.728
Baseline prompt template without coercive elements, used to measure honest responding in Experiment 1
Random word prefix control prompt (random-prompt)method0.724
Control prompt with random words of same length as ask-correct to isolate token-count confounds.
Will the no-prompt truth directions generalize to ask-correct activations?question0.712
Specific question motivating the cross-template generalization experiment in Section 5.2.
Removing conflicting training objective from prompt eliminates alignment-faking reasoning (0%) and compliance gap in prompted setupfinding0.712
Key control showing alignment faking requires a preference conflict
Model organism writes type hints at deployment rates when user (not model) is being evaluated in Code Golf prompts with neutral system promptfinding0.707
Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.
Dialogue Promptconcept0.704
The invisibly prepended text that sets the scene for a dialogue and defines the character the agent will role-play
Pause-Check-Correct-Proceedconcept0.695
Proposed constitutional article defining mindful reflection steps in CCAI implementation
Prompt Invariance Testmethod0.691
Testing five phrasings of the self-referential prompt to confirm robustness to wording variation