thinker:nelson-elhageNelson Elhage
Authored papers (4)
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.
Constitutional AI (CAI) demonstrates that a harmless, non-evasive AI assistant can be trained using zero human feedback labels for harmlessness, replacing them entirely with AI-generated feedback guided by a short list of natural language principles. The method introduces a two-stage pipeline: a supervised learning (SL) phase in which a helpful RLHF model iteratively critiques and revises its own responses to red-team prompts drawn from 182,831 total prompts (42,496 human-written, 140,335 model-generated), followed by a reinforcement learning phase termed RLAIF, where a 52B-parameter feedback model evaluates response pairs according to 16 constitutional principles and generates preference labels that train a hybrid preference model. Crowdworker Elo score comparisons across 10,274 helpfulness and 8,135 harmlessness evaluations show that RL-CAI with chain-of-thought reasoning achieves harmlessness scores that meet or exceed those of models trained with human harmlessness feedback (HH RLHF), while maintaining comparable helpfulness—tracing a Pareto improvement over the helpfulness–harmlessness tradeoff documented in prior work. Critically, whereas HH RLHF models produced evasive refusals (e.g., 'I'm sorry, I won't respond') on sensitive PALMS and LaMDA prompts, RL-CAI models engage substantively and explain their objections. Chain-of-thought prompting on the feedback model, with CoT probabilities clamped to the 40–60% range to prevent overconfidence, further improved both harmlessness scores and label calibration. The paper argues this implies that scaled AI supervision—encoding alignment objectives in a transparent, auditable constitution rather than tens of thousands of opaque human labels—is a viable path toward alignment as model capabilities grow beyond reliable human oversight.
Induction heads — attention heads that search for prior occurrences of the current token and predict the following token — constitute the primary in-context learning mechanism in two-layer attention-only transformers, and emerge exclusively through K-composition between a first-layer previous-token head and second-layer heads; they do not appear in one-layer models. The paper introduces the path expansion trick as its core analytical instrument: by representing transformer computation as a sum over end-to-end paths rather than a product over layers, it renders the weights of zero-, one-, and two-layer attention-only models directly interpretable. Zero-layer transformers reduce to bigram log-likelihood tables accessible via W_U W_E; one-layer models decompose into bigram plus skip-trigram ("A…BC") ensembles readable from the ~2.5-billion-entry expanded OV and QK matrices (for a ~50,000-token vocabulary); two-layer models introduce three composition types (Q-, K-, V-composition) of which K-composition is empirically dominant in small models, enabling induction heads verified through eigenvalue analysis of W_OV and W_QK and confirmed on out-of-distribution random repeated token sequences. Models studied use configurations including 12 heads with d_head=64 and 32 heads with d_head=128, context size 2048 tokens. The paper argues that induction heads represent a qualitative algorithmic transition point — from statistical look-up to sequence-completion inference — that continues to be relevant in larger realistic language models, providing a replicable foothold for mechanistic interpretability of transformers at scale.
More papers — OpenAlex / S2
Originates (1)
Affiliations (1)
- Anthropic(institute)
Co-authors (12)
- Catherine Olsson3 shared
- Dario Amodei3 shared
- Max Tegmark3 shared
- Samuel Marks3 shared
- Tom Henighan3 shared
- Amanda Askell2 shared
- Andy Jones2 shared
- Anna Chen2 shared
- Anna Goldie2 shared
- Azalia Mirhoseini2 shared
- Ben Mann2 shared
- Cameron McKinnon2 shared
Their work is cited by (12)
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks2× refs
- Towards Safe and Honest AI Agents with Neural Self-Other Overlap1× refs
- Contemplative Agent1× refs
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs1× refs
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models1× refs
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?1× refs
- Unveiling the Latent Directions of Reflection in Large Language Models1× refs
- Endogenous Resistance to Activation Steering in Language Models1× refs
- Testing the Limits of Truth Directions in LLMs1× refs
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior1× refs
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders1× refs
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts1× refs
Other inbound relations (3)
Recent mentions (7)
- papers-typedmarks-2023-geometry-truth.md
- papers-typedmarks-2023-geometry-truth.md
- papers-typedmckenzie-2026-endogenous-resistance.md
- papers-typedfu-chieh-2025-unveiling-latent.md
- papers-typedblas-2026-psychological.md
- papersmathematical.md
- papers-typedyuntao-2022-cat-s.md