Tom Henighan

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (5)

Towards monosemanticity: Decomposing language models with dictionary learning2023
referenced-only
CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence2022
Constitutional AI (CAI) demonstrates that a harmless, non-evasive AI assistant can be trained using zero human feedback labels for harmlessness, replacing them entirely with AI-generated feedback guided by a short list of natural language principles. The method introduces a two-stage pipeline: a supervised learning (SL) phase in which a helpful RLHF model iteratively critiques and revises its own responses to red-team prompts drawn from 182,831 total prompts (42,496 human-written, 140,335 model-generated), followed by a reinforcement learning phase termed RLAIF, where a 52B-parameter feedback model evaluates response pairs according to 16 constitutional principles and generates preference labels that train a hybrid preference model. Crowdworker Elo score comparisons across 10,274 helpfulness and 8,135 harmlessness evaluations show that RL-CAI with chain-of-thought reasoning achieves harmlessness scores that meet or exceed those of models trained with human harmlessness feedback (HH RLHF), while maintaining comparable helpfulness—tracing a Pareto improvement over the helpfulness–harmlessness tradeoff documented in prior work. Critically, whereas HH RLHF models produced evasive refusals (e.g., 'I'm sorry, I won't respond') on sensitive PALMS and LaMDA prompts, RL-CAI models engage substantively and explain their objections. Chain-of-thought prompting on the feedback model, with CoT probabilities clamped to the 40–60% range to prevent overconfidence, further improved both harmlessness scores and label calibration. The paper argues this implies that scaled AI supervision—encoding alignment objectives in a transparent, auditable constitution rather than tens of thousands of opaque human labels—is a viable path toward alignment as model capabilities grow beyond reliable human oversight.
Constitutional AI: Harmlessness from AI feedback2022
referenced-only
A Mathematical Framework for Transformer Circuits2021
Induction heads — attention heads that search for prior occurrences of the current token and predict the following token — constitute the primary in-context learning mechanism in two-layer attention-only transformers, and emerge exclusively through K-composition between a first-layer previous-token head and second-layer heads; they do not appear in one-layer models. The paper introduces the path expansion trick as its core analytical instrument: by representing transformer computation as a sum over end-to-end paths rather than a product over layers, it renders the weights of zero-, one-, and two-layer attention-only models directly interpretable. Zero-layer transformers reduce to bigram log-likelihood tables accessible via W_U W_E; one-layer models decompose into bigram plus skip-trigram ("A…BC") ensembles readable from the ~2.5-billion-entry expanded OV and QK matrices (for a ~50,000-token vocabulary); two-layer models introduce three composition types (Q-, K-, V-composition) of which K-composition is empirically dominant in small models, enabling induction heads verified through eigenvalue analysis of W_OV and W_QK and confirmed on out-of-distribution random repeated token sequences. Models studied use configurations including 12 heads with d_head=64 and 32 heads with d_head=128, context size 2048 tokens. The paper argues that induction heads represent a qualitative algorithmic transition point — from statistical look-up to sequence-completion inference — that continues to be relevant in larger realistic language models, providing a replicable foothold for mechanistic interpretability of transformers at scale.
Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet
referenced-only

More papers — OpenAlex / S2

Affiliations (1)

Anthropic(institute)

Co-authors (12)

Andy Jones3 shared
Catherine Olsson3 shared
Christopher Olah3 shared
Dario Amodei3 shared
Nelson Elhage3 shared
Tom Conerly3 shared
Adam Jermyn2 shared
Adly Templeton2 shared
Amanda Askell2 shared
Anna Chen2 shared
Anna Goldie2 shared
Azalia Mirhoseini2 shared

Their work is cited by (11)

Recent mentions (4)

papers
mathematical.md
papers
towards.md
papers
scaling.md
papers-typed
yuntao-2022-cat-s.md