method
active
method:prompt-sensitivity-analysisPrompt Sensitivity Analysis
Systematic modification of system prompt elements to identify which are necessary for alignment faking
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Deep responsiveness to local conditions, essential for a process to be living.
- Testing five phrasings of the self-referential prompt to confirm robustness to wording variation
- Requirement that answers to questions be responsive as well as truthful; requires knowing that questioner will know the answer after receiving it.
- Input from environment that the agent models and predicts.
- The capacity to distinguish which of multiple sentences received injection or which received stronger injection, contrasted with binary detection
- The phenomenon where life is created or destroyed by dimensional changes as small as a tenth of an inch.
- Equated with inference of past, present and future hidden states via minimization of variational free energy.