thinker:tom-henighanTom Henighan
Authored papers (5)
Constitutional AI (CAI) demonstrates that a harmless, non-evasive AI assistant can be trained using zero human feedback labels for harmlessness, replacing them entirely with AI-generated feedback guided by a short list of natural language principles. The method introduces a two-stage pipeline: a supervised learning (SL) phase in which a helpful RLHF model iteratively critiques and revises its own responses to red-team prompts drawn from 182,831 total prompts (42,496 human-written, 140,335 model-generated), followed by a reinforcement learning phase termed RLAIF, where a 52B-parameter feedback model evaluates response pairs according to 16 constitutional principles and generates preference labels that train a hybrid preference model. Crowdworker Elo score comparisons across 10,274 helpfulness and 8,135 harmlessness evaluations show that RL-CAI with chain-of-thought reasoning achieves harmlessness scores that meet or exceed those of models trained with human harmlessness feedback (HH RLHF), while maintaining comparable helpfulness—tracing a Pareto improvement over the helpfulness–harmlessness tradeoff documented in prior work. Critically, whereas HH RLHF models produced evasive refusals (e.g., 'I'm sorry, I won't respond') on sensitive PALMS and LaMDA prompts, RL-CAI models engage substantively and explain their objections. Chain-of-thought prompting on the feedback model, with CoT probabilities clamped to the 40–60% range to prevent overconfidence, further improved both harmlessness scores and label calibration. The paper argues this implies that scaled AI supervision—encoding alignment objectives in a transparent, auditable constitution rather than tens of thousands of opaque human labels—is a viable path toward alignment as model capabilities grow beyond reliable human oversight.
Induction heads — attention heads that search for prior occurrences of the current token and predict the following token — constitute the primary in-context learning mechanism in two-layer attention-only transformers, and emerge exclusively through K-composition between a first-layer previous-token head and second-layer heads; they do not appear in one-layer models. The paper introduces the path expansion trick as its core analytical instrument: by representing transformer computation as a sum over end-to-end paths rather than a product over layers, it renders the weights of zero-, one-, and two-layer attention-only models directly interpretable. Zero-layer transformers reduce to bigram log-likelihood tables accessible via W_U W_E; one-layer models decompose into bigram plus skip-trigram ("A…BC") ensembles readable from the ~2.5-billion-entry expanded OV and QK matrices (for a ~50,000-token vocabulary); two-layer models introduce three composition types (Q-, K-, V-composition) of which K-composition is empirically dominant in small models, enabling induction heads verified through eigenvalue analysis of W_OV and W_QK and confirmed on out-of-distribution random repeated token sequences. Models studied use configurations including 12 heads with d_head=64 and 32 heads with d_head=128, context size 2048 tokens. The paper argues that induction heads represent a qualitative algorithmic transition point — from statistical look-up to sequence-completion inference — that continues to be relevant in larger realistic language models, providing a replicable foothold for mechanistic interpretability of transformers at scale.
More papers — OpenAlex / S2
Affiliations (1)
- Anthropic(institute)
Co-authors (12)
- Andy Jones3 shared
- Catherine Olsson3 shared
- Christopher Olah3 shared
- Dario Amodei3 shared
- Nelson Elhage3 shared
- Tom Conerly3 shared
- Adam Jermyn2 shared
- Adly Templeton2 shared
- Amanda Askell2 shared
- Anna Chen2 shared
- Anna Goldie2 shared
- Azalia Mirhoseini2 shared
Their work is cited by (11)
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders3× refs
- Endogenous Resistance to Activation Steering in Language Models3× refs
- Contemplative Agent1× refs
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models1× refs
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?1× refs
- Unveiling the Latent Directions of Reflection in Large Language Models1× refs
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation1× refs
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior1× refs
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks1× refs
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training1× refs
- Towards Safe and Honest AI Agents with Neural Self-Other Overlap1× refs
Recent mentions (4)
- papersmathematical.md
- papers
towards.md - papers
scaling.md - papers-typedyuntao-2022-cat-s.md