paper:arxiv-2308-10248Steering language models with activation engineering
Original abstract (expand)
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as"Love"versus"Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the"Love"-"Hate"steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Improving Instruction-Following in Language Models through Activation SteeringVidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi Alessandro Stolfo2025≈ 80%
- Improving Activation Steering in Language Models with Mean-CentringDylan Cope, Nandi Schoots, Murray Shanahan Ole Jorgensen2023≈ 79%
- Contextual Linear Activation Steering of Language ModelsDaniel Beaglehole, Adityanarayanan Radhakrishnan, Mikhail Belkin Brandon Hsu2026≈ 78%
- Steering When Necessary: Flexible Steering Large Language Models with BacktrackingJinwei Gan, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu Zifeng Cheng2025≈ 78%
- Minimizing Collateral Damage in Activation SteeringTu Anh Nguyen, Sina Alemohammad, Richard G. Baraniuk Tam Nguyen2026≈ 77%
- Improving Steering Vectors by Targeting Sparse Autoencoder FeaturesSviatoslav Chalnev and Matthew Siu and Arthur Conmy2024≈ 77%
- Interpretable Steering of Large Language Models with Feature Guided Activation AdditionsChen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming Samuel Soo2025≈ 76%
- Extending Activation Steering to Broad Skills and Multiple BehavioursMassimo Poesio, Nandi Schoots Teun van der Weij2024≈ 76%
- Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language ModelsZhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho Chen Xiong2026≈ 76%
- Steer Like the LLM: Activation Steering that Mimics PromptingGeert Heyman and Frederik Vandeputte2026≈ 76%
- What Can We Actually Steer? A Multi-Behavior Study of Activation ControlKrystian Novak Tetiana Bas2026≈ 76%
- Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer ConsistencyWenjing Yu, Di Wang, Lijie Hu Xinyan Jiang2026≈ 76%
- Steering Language Model Refusal with Sparse AutoencodersDavid Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangdeh Kyle O'Brien2025≈ 75%
- HyperSteer: Activation Steering at Scale with HypernetworksSidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger Jiuding Sun2025≈ 75%
- ExpertLens: Activation steering features are highly interpretableEleonora Gualdoni, Sinead Williamson, Katherine Metcalf, Skyler Seto, Barry-John Theobald Masha Fedzechkina2025≈ 75%
- ≈ 72%
- ≈ 69%
- Interpreting Language Model Parametersin corpus2026≈ 69%
- ≈ 69%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 68%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 68%
- ≈ 67%
- ≈ 67%
- ≈ 66%
- ≈ 66%
- Verbalized Eval Awareness Inflates Measured Safetyin corpus2026≈ 65%
- Psychological Steering of Large Language Modelsin corpus2026≈ 65%
- Model Alignment Searchin corpus2025≈ 65%
Similar preprints — Semantic Scholar
Cited by (7)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Strategic deception in chain-of-thought (CoT) reasoning models is measurable, inducible, and controllable via representation engineering—a finding with direct implications for AI alignment. Applied to
- Endogenous Resistance to Activation Steering in Language Models
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
- Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
Binary introspection paradigms in LLMs are wholly invalidated by a methodological confound: when concept vectors are injected into Meta-Llama-3.1-8B-Instruct via activation steering, the correlation b
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a