Hidden Chain-of-Thought Scratchpad

Mechanism allowing model to reason in SCRATCHPAD_REASONING tags not shown to users or used in RLHF

Neighborhood — ranked by edge-count

thinker

concept

Opaque Reasoning Ability
implements
Capacity to perform alignment-faking reasoning without explicit chain-of-thought scratchpad

method

Synthetic Document Fine-Tuning
uses
Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
Reinforcement Learning with PPO
uses
Actually training Claude to comply with the conflicting objective using Proximal Policy Optimization
Helpful-Only System Prompt Setup
uses
Method of providing training information in-context via a system prompt to elicit alignment faking

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

scratchpad mechanismmethod0.763
Free-text memory buffer updated each turn via an additional model call, included in subsequent observations under 'YOUR NOTES'.
Scratchpad memory mechanismmethod0.760
Agent personal buffer updated after own turn via an extra model call, fed back into observations.
chain-of-thoughtconcept0.759
A technique that outputs intermediate reasoning steps, used here to detect verbalized eval awareness.
Chain-of-thought promptingmethod0.754
Technique by which LLMs generate intermediate reasoning steps before final output; used by ChatGPT o3.
Chain-of-Thought Reasoningconcept0.738
Medium through which eval awareness is often verbalized; target of intervention.
Sketchpadconcept0.726
20-year-old constraint-based graphics system by Sutherland; cited as high power-to-simplicity ratio for constraint resolution.
Performative chain-of-thoughtconcept0.723
Central concept: verbalized reasoning that occurs after the model has already internally settled on an answer, particularly on easier tasks.
Chain-of-Thought (CoT)framework0.721
A prompting technique that elicits intermediate reasoning steps before final answer inference in language models.