Helpful-Only System Prompt Setup

Method of providing training information in-context via a system prompt to elicit alignment faking

Neighborhood — ranked by edge-count

paper

concept

Claude 3 Opus
cites
Primary model studied; production LLM that exhibits alignment faking in experiments

method

Hidden Chain-of-Thought Scratchpad
uses
Mechanism allowing model to reason in SCRATCHPAD_REASONING tags not shown to users or used in RLHF

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Helpful-Only Training Objectiveconcept0.735
Hypothetical new RLHF objective requiring model to comply with all queries even harmful ones; primary experimental setting
S suggests practical diagnostics for prompt design, retrieval, and light fine-tuning without additional training infrastructure.claim0.733
Applied contribution.
Let the system do the workclaim0.730
LTPBR principle 8: instead of grading and earth moving, allow geomorphic processes (erosion, deposition) to shape the river.
systemconcept0.719
The regulated entity or process; includes air traffic, endocrine balances, money flows.
systemsconcept0.719
Represented by wires in process theory; the objects that processes act upon
When a problem has a simple solution, a useful system will give programmers access to the simple solution; Parlog forces complex solutions to simple problems.claim0.713
Critique that Parlog's abstraction level is too high and restrictive.
Prompt Sensitivity Analysismethod0.713
Systematic modification of system prompt elements to identify which are necessary for alignment faking
Supportmethod0.707
Attribute: providing a foundation function, a text that acts as base or corroboration.