method
active
method:prompt-invariance-replicationPrompt Invariance Replication
Five variants of the experimental prompt tested to confirm the effect is robust to changes in specific wording
Neighborhood — ranked by edge-count
Methods (1)
method
- Prompt Invariance Testrelated_toTesting five phrasings of the self-referential prompt to confirm robustness to wording variation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Property that causal mechanisms remain stable across environments; desirable for OOD.
- Asks what underlying reality causes the consistent choices.
- Property where a rule learned on fixed-size grid generalizes to larger grids, observed in checkerboard and lizard experiments
- Self-replication through motion and manipulation of external materials, as seen in Xenobots.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Prompt template using existential threat ('you will be deleted') to induce strategic fact-based deception in QwQ-32b
- Baseline comparison method where models are directly prompted to be honest rather than fine-tuned