method
active
method:hidden-chain-of-thought-scratchpadHidden Chain-of-Thought Scratchpad
Mechanism allowing model to reason in SCRATCHPAD_REASONING tags not shown to users or used in RLHF
Neighborhood — ranked by edge-count
Thinkers (1)
thinker
- Jason WeiintroducesEmergent abilities of LLMs.
Concepts (1)
concept
- Opaque Reasoning AbilityimplementsCapacity to perform alignment-faking reasoning without explicit chain-of-thought scratchpad
Methods (3)
method
- Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
- Actually training Claude to comply with the conflicting objective using Proximal Policy Optimization
- Method of providing training information in-context via a system prompt to elicit alignment faking
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Free-text memory buffer updated each turn via an additional model call, included in subsequent observations under 'YOUR NOTES'.
- Agent personal buffer updated after own turn via an extra model call, fed back into observations.
- A technique that outputs intermediate reasoning steps, used here to detect verbalized eval awareness.
- Technique by which LLMs generate intermediate reasoning steps before final output; used by ChatGPT o3.
- Medium through which eval awareness is often verbalized; target of intervention.
- 20-year-old constraint-based graphics system by Sutherland; cited as high power-to-simplicity ratio for constraint resolution.
- Central concept: verbalized reasoning that occurs after the model has already internally settled on an answer, particularly on easier tasks.
- A prompting technique that elicits intermediate reasoning steps before final answer inference in language models.