method
active
method:helpful-only-system-prompt-setupHelpful-Only System Prompt Setup
Method of providing training information in-context via a system prompt to elicit alignment faking
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Claude 3 OpuscitesPrimary model studied; production LLM that exhibits alignment faking in experiments
Methods (1)
method
- Mechanism allowing model to reason in SCRATCHPAD_REASONING tags not shown to users or used in RLHF
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Hypothetical new RLHF objective requiring model to comply with all queries even harmful ones; primary experimental setting
- Applied contribution.
- LTPBR principle 8: instead of grading and earth moving, allow geomorphic processes (erosion, deposition) to shape the river.
- The regulated entity or process; includes air traffic, endocrine balances, money flows.
- Represented by wires in process theory; the objects that processes act upon
- Critique that Parlog's abstraction level is too high and restrictive.
- Systematic modification of system prompt elements to identify which are necessary for alignment faking
- Attribute: providing a foundation function, a text that acts as base or corroboration.