Steering language models with activation engineering

ByA. M. Turner·L. Thiergart·G. Leech·D. Udell·J. J. Vazquez·U. Mini+1 more

Original abstract (expand)

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as"Love"versus"Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the"Love"-"Hate"steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Improving Instruction-Following in Language Models through Activation Steering
Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi Alessandro Stolfo
2025
≈ 80%
Improving Activation Steering in Language Models with Mean-Centring
Dylan Cope, Nandi Schoots, Murray Shanahan Ole Jorgensen
2023
≈ 79%
Contextual Linear Activation Steering of Language Models
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Mikhail Belkin Brandon Hsu
2026
≈ 78%
Steering When Necessary: Flexible Steering Large Language Models with Backtracking
Jinwei Gan, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu Zifeng Cheng
2025
≈ 78%
Minimizing Collateral Damage in Activation Steering
Tu Anh Nguyen, Sina Alemohammad, Richard G. Baraniuk Tam Nguyen
2026
≈ 77%
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Sviatoslav Chalnev and Matthew Siu and Arthur Conmy
2024
≈ 77%
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming Samuel Soo
2025
≈ 76%
Extending Activation Steering to Broad Skills and Multiple Behaviours
Massimo Poesio, Nandi Schoots Teun van der Weij
2024
≈ 76%
Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models
Zhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho Chen Xiong
2026
≈ 76%
Steer Like the LLM: Activation Steering that Mimics Prompting
Geert Heyman and Frederik Vandeputte
2026
≈ 76%
What Can We Actually Steer? A Multi-Behavior Study of Activation Control
Krystian Novak Tetiana Bas
2026
≈ 76%
Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency
Wenjing Yu, Di Wang, Lijie Hu Xinyan Jiang
2026
≈ 76%
Steering Language Model Refusal with Sparse Autoencoders
David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangdeh Kyle O'Brien
2025
≈ 75%
HyperSteer: Activation Steering at Scale with Hypernetworks
Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger Jiuding Sun
2025
≈ 75%
ExpertLens: Activation steering features are highly interpretable
Eleonora Gualdoni, Sinead Williamson, Katherine Metcalf, Skyler Seto, Barry-John Theobald Masha Fedzechkina
2025
≈ 75%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 72%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 69%
Interpreting Language Model Parameters
in corpus
2026
≈ 69%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 69%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 68%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 68%
Steering Along Manifolds to Control Neural Networks
in corpus
≈ 67%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 67%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 66%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 66%
Verbalized Eval Awareness Inflates Measured Safety
in corpus
2026
≈ 65%
Psychological Steering of Large Language Models
in corpus
2026
≈ 65%
Model Alignment Search
in corpus
2025
≈ 65%

Similar preprints — Semantic Scholar

Cited by (7)

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Strategic deception in chain-of-thought (CoT) reasoning models is measurable, inducible, and controllable via representation engineering—a finding with direct implications for AI alignment. Applied to
Endogenous Resistance to Activation Steering in Language Models
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation b
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a