paper
active
2026
paper:doi-10-48550-arxiv-2602-06941

Endogenous Resistance to Activation Steering in Language Models

Methods (6)

  • Activation Steering
    Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
  • ESR Testing Pipeline
    Three-step protocol: (1) object-level prompting, (2) SAE-latent steering, (3) judge model scoring of attempts
  • Meta-Prompting for ESR Enhancement
    Appending instructional meta-prompts to object-level prompts to deliberately enhance ESR in models
  • Off-Topic Detector Latent Ablation
    Causal intervention clamping 26 identified OTD latents to zero during steered inference to test ESR contribution
  • sparse autoencoders
    Existing method for model interpretability that decodes model activations rather than parameters themselves, noted as incomplete solution.
  • Synthetic Self-Correction Fine-Tuning
    Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior

Datasets (4)

Findings (22)

Claims (12)

Hypotheses (3)

Questions (6)

Original abstract (expand)

Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+25 more

Similar preprints — Semantic Scholar