method
active
method:prompt-invariance-test

Prompt Invariance Test

Testing five phrasings of the self-referential prompt to confirm robustness to wording variation

Neighborhood — ranked by edge-count

Methods (1)

method
  • Five variants of the experimental prompt tested to confirm the effect is robust to changes in specific wording

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Systematic modification of system prompt elements to identify which are necessary for alignment faking
  • Causal Invarianceconcept0.753
    Property that causal mechanisms remain stable across environments; desirable for OOD.
  • Property where a rule learned on fixed-size grid generalizes to larger grids, observed in checkerboard and lizard experiments
  • 5x5 Pearson correlation matrix of OCEAN traits computed from MDS injection sweeps to assess cross-trait leakage
  • Baseline comparison method where models are directly prompted to be honest rather than fine-tuned
  • strawberry testconcept0.724
    Eliezer Yudkowsky's benchmark for LLM awareness, mentioned as test that collapsed-awareness models might fail.
  • Open-ended situational judgment tests synthesized using GPT-5.1 from ATOMIC10x heads and inventory items; primary evaluation instrument for open-ended steering