paper:doi-10-48550-arxiv-2602-06941Endogenous Resistance to Activation Steering in Language Models
Methods (6)
- Activation SteeringCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- ESR Testing PipelineThree-step protocol: (1) object-level prompting, (2) SAE-latent steering, (3) judge model scoring of attempts
- Meta-Prompting for ESR EnhancementAppending instructional meta-prompts to object-level prompts to deliberately enhance ESR in models
- Off-Topic Detector Latent AblationCausal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution
- sparse autoencodersExisting method for model interpretability that decodes model activations rather than parameters themselves, noted as incomplete solution.
- Synthetic Self-Correction Fine-TuningFine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
Datasets (4)
- 146 Self-Correction Episodes from Llama-3.3-70BDataset of confirmed self-correction episodes used for sequential activation analysis
- 38 Object-Level Explain-How PromptsCurated set of 38 instructional prompts used as evaluation stimuli across all experiments
- Gemma-2-27B-it27B parameter LLM used in SOO fine-tuning experiments
- Synthetic Self-Correction Training ExamplesClaude 4.5 Sonnet-generated training data pairing prompts with off-topic starts, corrections, and correct answers
Findings (22)
- Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratio
Shows behavioral pattern of self-correction is trainable in smaller models
- Approximately half of the 26 OTD latents show near-zero or negative effect sizes, activating more during on-topic content
Reveals that contrastive search yields a heterogeneous set, not all functioning as true off-topic detectors
- Backtracking latents remain low during off-topic content and peak shortly after self-correction begins in Llama-3.3-70B
Complementary temporal activation pattern suggesting distinct roles for OTD and backtracking latent classes
- Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
- Ali et al. 2025 found contrastive activation addition less effective at larger model scale, consistent with ESR in 70B
Prior finding from related work that aligns with ESR being strongest in the largest model tested
- All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other models
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
- ESR exhibits non-monotonic relationship with boost level, peaking around -0.3σ below threshold in Llama-3.3-70B
Characterizes the narrow operating window in which ESR can manifest
- OTD latent ablation leaves mean first-attempt score unchanged (baseline 26.3, ablation 27.4) in Llama-3.3-70B
Evidence that OTDs specifically support meta-cognitive monitoring rather than general response generation
- OTD latent activation begins declining before verbal self-correction appears in the output in Llama-3.3-70B
Temporal pattern consistent with internal monitoring process preceding explicit self-correction
- OTD latents fire 4.4× higher during off-topic content compared to baseline episodes without self-correction
Quantitative characterization of OTD activation differential establishing their off-topic monitoring role
Claims (12)
- Off-topic detector is a functional label based on selection methodology; these latents may serve broader coherence-monitoring roles beyond detecting off-topic content
Epistemic caution about over-interpreting the OTD label given the heterogeneity of identified latents
- ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions relying on activation steering
Core policy-relevant implication of the paper for AI safety
- ESR parallels endogenous attention control in biological systems where top-down mechanisms detect distracting inputs and redirect processing
Cross-domain analogy linking ESR to Attention Schema Theory
- The 25% reduction in multi-attempt rate from OTD ablation suggests additional mechanisms contribute to ESR beyond the identified latents
Acknowledges incompleteness of the causal account, suggesting redundant circuits or nonlinear interactions
- The meta-prompting scaling pattern suggests underlying self-monitoring circuits must already be present for prompting to enhance them
Mechanistic interpretation of why meta-prompting effects scale with model size
- ESR differs from the Hydra Effect in that ESR involves active, online detection and correction with explicit self-interruption tokens
Distinguishes ESR from prior work on model self-repair
- Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectively
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representation
Extension of mechanistic interpretability findings to the metacognitive domain
- The 26 differentially-activated OTD latents play a causally important role in enabling ESR in Llama-3.3-70B
Causal interpretation of the ablation experiment results
- We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70B
Epistemic limitation claim acknowledging confounds in the cross-model comparison
Hypotheses (3)
- We hypothesize earlier-layer interventions allow more downstream computation to process and potentially correct the perturbation
Post-hoc explanation for why steering at layer 33 rather than layer 50 produced better ESR behavior in Llama-3.3-70B
- We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representations
Open question about the developmental origin of ESR mechanisms
- We hypothesize ESR might be adversarially circumvented through targeted interventions
Open safety-relevant question about whether ESR can be bypassed
Questions (6)
- How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?
Key open question for AI safety implications of ESR
- What is the full computational pathway underlying self-correction across multiple layers?
Mechanistic question requiring multi-layer SAE analysis beyond current single-layer approach
- Does ESR emerge from RLHF or does it exist in pretrained representations?
Open question about developmental origin of ESR mechanisms
- Does ESR reflect model scale, architecture, or training procedures?
Central unresolved question about the mechanism behind ESR's apparent size-dependence
- Do large language models monitor their own internal states?
Framing question that motivates the entire paper
- Can ESR be adversarially circumvented?
Open security question about robustness of ESR-based defenses
Original abstract (expand)
Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Mechanistic Indicators of Steering Effectiveness in Large Language ModelsHao Xue, Flora Salim Mehdi Jafari2026≈ 87%
- Steering When Necessary: Flexible Steering Large Language Models with BacktrackingJinwei Gan, Zhiwei Jiang, Cong Wang, Yafeng Yin, Xiang Luo, Yuchen Fu, Qing Gu Zifeng Cheng2025≈ 87%
- Fine-Grained Activation Steering: Steering Less, Achieving MoreTianjiao Li, Zixiao Zhu, Hanzhang Zhou, Junlang Qian, Li Zhang, Jia Jim Deryl Chua, Lee Onn Mak, Gee Wah Ng, Kezhi Mao Zijian Feng2026≈ 86%
- What Can We Actually Steer? A Multi-Behavior Study of Activation ControlKrystian Novak Tetiana Bas2026≈ 86%
- Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer ConsistencyWenjing Yu, Di Wang, Lijie Hu Xinyan Jiang2026≈ 86%
- Steering Language Model Refusal with Sparse AutoencodersDavid Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangdeh Kyle O'Brien2025≈ 86%
- Interpretable Steering of Large Language Models with Feature Guided Activation AdditionsChen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming Samuel Soo2025≈ 86%
- Steer Like the LLM: Activation Steering that Mimics PromptingGeert Heyman and Frederik Vandeputte2026≈ 86%
- Steering Conceptual Bias via Transformer Latent-Subspace ActivationVansh Sharma and Venkat Raman2025≈ 86%
- Steering Large Language Model Activations in Sparse SpacesAli Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent Reza Bayat2025≈ 86%
- Extending Activation Steering to Broad Skills and Multiple BehavioursMassimo Poesio, Nandi Schoots Teun van der Weij2024≈ 85%
- ≈ 85%
- Contextual Linear Activation Steering of Language ModelsDaniel Beaglehole, Adityanarayanan Radhakrishnan, Mikhail Belkin Brandon Hsu2026≈ 85%
- ≈ 85%
- The Effectiveness of Style Vectors for Steering Large Language Models: A Human EvaluationKatharina Dworatzyk, Sophie Jentzsch, Peer Sch\"utt, Sabine Theis, Tobias Hecking Diaoul\'e Diallo2026≈ 85%
- Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural RepresentationsHaijiang Yan, Thomas L. Griffiths Jian-Qiao Zhu2025≈ 85%
- ≈ 84%
- ≈ 82%
- Psychological Steering of Large Language Modelsin corpus2026≈ 81%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 81%
- ≈ 79%
- ≈ 79%
- ≈ 79%
- ≈ 79%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 79%
- ≈ 78%
- Verbalized Eval Awareness Inflates Measured Safetyin corpus2026≈ 78%
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 78%
- ≈ 74%
- ≈ 73%
+25 more