thinker:neel-nandaNeel Nanda
External commenter; resolved apparent counterexample to linear representation hypothesis
Authored papers (2)
Contrastive activation steering can suppress evaluation-awareness and elicit genuine deployment behavior from a deliberately trained model organism, not merely silence verbalizations of being tested. Working with Llama 3.3 Nemotron Super 49B, the authors trained an evaluation-aware model using a two-step pipeline—synthetic document fine-tuning (SDF) on 115.6 million tokens followed by four rounds of expert iteration on 41,290 examples—that causes the model to write Python type hints 80–86% of the time in evaluation contexts while writing them only 33% of the time in true deployment contexts. A fictional evaluation company cue ('Wood Labs') was embedded as an adversarial test: even when user prompts explicitly claimed deployment, the Wood Labs system prompt maintained type hint rates above 80%, and simply concatenating all 16 contrastive prompt pairs as text failed to reduce that rate to deployment levels. The method introduced—contrastive activation steering extracted from the pre-fine-tuned base model using 16 deployment/evaluation prompt pairs applied at layers 10, 14, 18, 22, 26, and 30 with scaling factor 0.6—reduced type hint rates to below 11% even with the evaluation cue present, while random Gaussian steering vectors of matched norm produced no comparable suppression. Crucially, the steering vector was extracted before any fine-tuning, making it implausible that it directly encodes type-hint information; it can only work by modifying the model's contextual beliefs, a mechanism corroborated by GPT-5-graded reasoning traces showing a strong correlation between steered deployment beliefs and reduced type hint rates. The paper argues this validates activation steering as a tool for AI evaluators to uncover deployment behavior during safety-critical evaluations, including honeypot scenarios designed to detect sandbagging and alignment faking.
Induction heads — attention heads that search for prior occurrences of the current token and predict the following token — constitute the primary in-context learning mechanism in two-layer attention-only transformers, and emerge exclusively through K-composition between a first-layer previous-token head and second-layer heads; they do not appear in one-layer models. The paper introduces the path expansion trick as its core analytical instrument: by representing transformer computation as a sum over end-to-end paths rather than a product over layers, it renders the weights of zero-, one-, and two-layer attention-only models directly interpretable. Zero-layer transformers reduce to bigram log-likelihood tables accessible via W_U W_E; one-layer models decompose into bigram plus skip-trigram ("A…BC") ensembles readable from the ~2.5-billion-entry expanded OV and QK matrices (for a ~50,000-token vocabulary); two-layer models introduce three composition types (Q-, K-, V-composition) of which K-composition is empirically dominant in small models, enabling induction heads verified through eigenvalue analysis of W_OV and W_QK and confirmed on out-of-distribution random repeated token sequences. Models studied use configurations including 12 heads with d_head=64 and 32 heads with d_head=128, context size 2048 tokens. The paper argues that induction heads represent a qualitative algorithmic transition point — from statistical look-up to sequence-completion inference — that continues to be relevant in larger realistic language models, providing a replicable foothold for mechanistic interpretability of transformers at scale.
More papers — OpenAlex / S2
Affiliations (1)
- Anthropic(institute)
Co-authors (12)
- Hua, Tim Tian2 shared
- Marks, Samuel2 shared
- Nanda, Neel2 shared
- Qin, Andrew2 shared
- Andrew Qin1 shared
- Catherine Olsson1 shared
- Chris Olah1 shared
- Dario Amodei1 shared
- Nelson Elhage1 shared
- Samuel Marks1 shared
- Tim Tian Hua1 shared
- Tom Henighan1 shared
Their work is cited by (6)
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks1× refs
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models1× refs
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?1× refs
- Unveiling the Latent Directions of Reflection in Large Language Models1× refs
- Endogenous Resistance to Activation Steering in Language Models1× refs
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders1× refs
Other inbound relations (3)
Recent mentions (8)
- papers-typedgrant-2025-addressing-divergent.md
- papers-typedtian-2025-steering-evaluation.md
- papers-typedyu-2025-directions-cones.md
- papers-typeddenis-2025-linear-representation.md
- papers-typedwu-2024-pyvene-library.md
- papers-typedfu-chieh-2025-unveiling-latent.md
- papersmathematical.md
- papers
towards.md