Prompt Invariance Test

Testing five phrasings of the self-referential prompt to confirm robustness to wording variation

Neighborhood — ranked by edge-count

method

Prompt Invariance Replication
related_to
Five variants of the experimental prompt tested to confirm the effect is robust to changes in specific wording

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Prompt Sensitivity Analysismethod0.785
Systematic modification of system prompt elements to identify which are necessary for alignment faking
Organizational Invarianceconcept0.754
Causal Invarianceconcept0.753
Property that causal mechanisms remain stable across environments; desirable for OOD.
Boundary-Size Invarianceconcept0.740
Property where a rule learned on fixed-size grid generalizes to larger grids, observed in checkerboard and lizard experiments
OCEAN Trait Covariance Matrix Mmethod0.732
5x5 Pearson correlation matrix of OCEAN traits computed from MDS injection sweeps to assess cross-trait leakage
Honesty Prompt Baselinemethod0.725
Baseline comparison method where models are directly prompted to be honest rather than fine-tuned
strawberry testconcept0.724
Eliezer Yudkowsky's benchmark for LLM awareness, mentioned as test that collapsed-awareness models might fail.
Synthetic Situational Judgment Test Batterymethod0.723
Open-ended situational judgment tests synthesized using GPT-5.1 from ATOMIC10x heads and inventory items; primary evaluation instrument for open-ended steering