Endogenous Resistance to Activation Steering in Language Models

ByAlex McKenzie·Keenan Pepper·Stijn Servaes·Martin Leitgab·Murat Cubuktepe ⓘ·Mike Vaiana+3 moreAE Studio, Princeton Neuroscience Institute & Department of Psychology, Princeton University

DOI 10.48550/arxiv.2602.06941 arXiv 2602.06941 OpenAlex W7128405121

Endogenous Steering Resistance Activation Steering 146 Self-Correction Episodes from Llama-3.3-70B Gemma-2-2B-it ESR Testing Pipeline 38 Object-Level Explain-How Prompts Gemma-2-9B-it Meta-Prompting for ESR Enhancement Gemma-2-27B-it Internal Consistency Monitoring Off-Topic Detector Latent Ablation Synthetic Self-Correction Training Examples Llama-3.1-8B-Instruct sparse autoencoders+4 more

Methods (6)

Activation Steering
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
ESR Testing Pipeline
Three-step protocol: (1) object-level prompting, (2) SAE-latent steering, (3) judge model scoring of attempts
Meta-Prompting for ESR Enhancement
Appending instructional meta-prompts to object-level prompts to deliberately enhance ESR in models
Off-Topic Detector Latent Ablation
Causal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution
sparse autoencoders
Existing method for model interpretability that decodes model activations rather than parameters themselves, noted as incomplete solution.
Synthetic Self-Correction Fine-Tuning
Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior

Datasets (4)

146 Self-Correction Episodes from Llama-3.3-70B
Dataset of confirmed self-correction episodes used for sequential activation analysis
38 Object-Level Explain-How Prompts
Curated set of 38 instructional prompts used as evaluation stimuli across all experiments
Gemma-2-27B-it
27B parameter LLM used in SOO fine-tuning experiments
Synthetic Self-Correction Training Examples
Claude 4.5 Sonnet-generated training data pairing prompts with off-topic starts, corrections, and correct answers

Findings (22)

Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratio
Shows behavioral pattern of self-correction is trainable in smaller models
Approximately half of the 26 OTD latents show near-zero or negative effect sizes, activating more during on-topic content
Reveals that contrastive search yields a heterogeneous set, not all functioning as true off-topic detectors
Backtracking latents remain low during off-topic content and peak shortly after self-correction begins in Llama-3.3-70B
Complementary temporal activation pattern suggesting distinct roles for OTD and backtracking latent classes
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
Ali et al. 2025 found contrastive activation addition less effective at larger model scale, consistent with ESR in 70B
Prior finding from related work that aligns with ESR being strongest in the largest model tested
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other models
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
ESR exhibits non-monotonic relationship with boost level, peaking around -0.3σ below threshold in Llama-3.3-70B
Characterizes the narrow operating window in which ESR can manifest
OTD latent ablation leaves mean first-attempt score unchanged (baseline 26.3, ablation 27.4) in Llama-3.3-70B
Evidence that OTDs specifically support meta-cognitive monitoring rather than general response generation
OTD latent activation begins declining before verbal self-correction appears in the output in Llama-3.3-70B
Temporal pattern consistent with internal monitoring process preceding explicit self-correction
OTD latents fire 4.4× higher during off-topic content compared to baseline episodes without self-correction
Quantitative characterization of OTD activation differential establishing their off-topic monitoring role

Claims (12)

Off-topic detector is a functional label based on selection methodology; these latents may serve broader coherence-monitoring roles beyond detecting off-topic content
Epistemic caution about over-interpreting the OTD label given the heterogeneity of identified latents
ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions relying on activation steering
Core policy-relevant implication of the paper for AI safety
ESR parallels endogenous attention control in biological systems where top-down mechanisms detect distracting inputs and redirect processing
Cross-domain analogy linking ESR to Attention Schema Theory
The 25% reduction in multi-attempt rate from OTD ablation suggests additional mechanisms contribute to ESR beyond the identified latents
Acknowledges incompleteness of the causal account, suggesting redundant circuits or nonlinear interactions
The meta-prompting scaling pattern suggests underlying self-monitoring circuits must already be present for prompting to enhance them
Mechanistic interpretation of why meta-prompting effects scale with model size
ESR differs from the Hydra Effect in that ESR involves active, online detection and correction with explicit self-interruption tokens
Distinguishes ESR from prior work on model self-repair
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectively
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representation
Extension of mechanistic interpretability findings to the metacognitive domain
The 26 differentially-activated OTD latents play a causally important role in enabling ESR in Llama-3.3-70B
Causal interpretation of the ablation experiment results
We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70B
Epistemic limitation claim acknowledging confounds in the cross-model comparison

Hypotheses (3)

We hypothesize earlier-layer interventions allow more downstream computation to process and potentially correct the perturbation
Post-hoc explanation for why steering at layer 33 rather than layer 50 produced better ESR behavior in Llama-3.3-70B
We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representations
Open question about the developmental origin of ESR mechanisms
We hypothesize ESR might be adversarially circumvented through targeted interventions
Open safety-relevant question about whether ESR can be bypassed

Questions (6)

How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?
Key open question for AI safety implications of ESR
What is the full computational pathway underlying self-correction across multiple layers?
Mechanistic question requiring multi-layer SAE analysis beyond current single-layer approach
Does ESR emerge from RLHF or does it exist in pretrained representations?
Open question about developmental origin of ESR mechanisms
Does ESR reflect model scale, architecture, or training procedures?
Central unresolved question about the mechanism behind ESR's apparent size-dependence
Do large language models monitor their own internal states?
Framing question that motivates the entire paper
Can ESR be adversarially circumvented?
Open security question about robustness of ESR-based defenses

Original abstract (expand)

Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Mechanistic Indicators of Steering Effectiveness in Large Language Models
Hao Xue, Flora Salim Mehdi Jafari
2026
≈ 87%
Steering When Necessary: Flexible Steering Large Language Models with Backtracking
Jinwei Gan, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu Zifeng Cheng
2025
≈ 87%
Fine-Grained Activation Steering: Steering Less, Achieving More
Tianjiao Li, Zixiao Zhu, Hanzhang Zhou, Junlang Qian, Li Zhang, Jia Jim Deryl Chua, Lee Onn Mak, Gee Wah Ng, Kezhi Mao Zijian Feng
2026
≈ 86%
What Can We Actually Steer? A Multi-Behavior Study of Activation Control
Krystian Novak Tetiana Bas
2026
≈ 86%
Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency
Wenjing Yu, Di Wang, Lijie Hu Xinyan Jiang
2026
≈ 86%
Steering Language Model Refusal with Sparse Autoencoders
David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangdeh Kyle O'Brien
2025
≈ 86%
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming Samuel Soo
2025
≈ 86%
Steer Like the LLM: Activation Steering that Mimics Prompting
Geert Heyman and Frederik Vandeputte
2026
≈ 86%
Steering Conceptual Bias via Transformer Latent-Subspace Activation
Vansh Sharma and Venkat Raman
2025
≈ 86%
Steering Large Language Model Activations in Sparse Spaces
Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent Reza Bayat
2025
≈ 86%
Extending Activation Steering to Broad Skills and Multiple Behaviours
Massimo Poesio, Nandi Schoots Teun van der Weij
2024
≈ 85%
Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs
Shivam Dubey
2025
≈ 85%
Contextual Linear Activation Steering of Language Models
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Mikhail Belkin Brandon Hsu
2026
≈ 85%
Steering language models with activation engineering
cited
2023
≈ 85%
The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation
Katharina Dworatzyk, Sophie Jentzsch, Peer Sch\"utt, Sabine Theis, Tobias Hecking Diaoul\'e Diallo
2026
≈ 85%
Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations
Haijiang Yan, Thomas L. Griffiths Jian-Qiao Zhu
2025
≈ 85%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 84%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 82%
Psychological Steering of Large Language Models
in corpus
2026
≈ 81%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 81%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 79%
Persistence and Introspection of Emotion Features
in corpus
≈ 79%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 79%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 79%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 79%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 78%
Verbalized Eval Awareness Inflates Measured Safety
in corpus
2026
≈ 78%
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
in corpus
2026
≈ 78%
Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet
cited
≈ 74%
A Mathematical Framework for Transformer Circuits
cited
2021
≈ 73%

+25 more