Endogenous Steering Resistance

The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Frameworks (2)

framework

Attention Schema Theory
analogous_to
Theory by Graziano linking consciousness to a predictive model of attention; listed in Butlin et al. 2023.
Representation Engineering
contradicts
A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework

Communities (1)

community

LLM Introspection
associated_with

Methods (2)

method

Synthetic Self-Correction Fine-Tuning
about
Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
Meta-Prompting for ESR Enhancement
about
Appending instructional meta-prompts to object-level prompts to deliberately enhance ESR in models

Concepts (14)

concept

AI Alignment and Safety
associated_with
The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
Internal Consistency Monitoring
implements
The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs
Top-Down Attentional Control
analogous_to
Biological analogue to ESR where top-down mechanisms detect distracting inputs and redirect processing
Explicit ESR
extends
The form of ESR focused on in this paper, measured by verbal self-interruption phrases as segment boundaries
Implicit ESR
extends
Form of ESR occurring without explicit verbal self-interruption markers, not captured by current metrics
Relevance Filtering of SAE Latents
supports
Pre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering
Hydra Effect
contradicts
Phenomenon where layer ablations trigger silent downstream compensation, contrasted with ESR
LLM Self-Correction
extends
Related capability where LLMs correct their own outputs, studied via linear representations.
Concreteness Filtering of SAE Latents
supports
Pre-filtering step excluding abstract latents where off-topic detection is harder
Conditional Mean Score Improvement (metric)
implements
Score delta between last and first attempt for multi-attempt responses, measuring correction effectiveness
ESR Rate (metric)
implements
Primary metric: percentage of responses containing multiple attempts that successfully improve on the first attempt
Multi-Attempt Rate (metric)
implements
Secondary metric: percentage of responses containing multiple attempts, separating surface from actual self-correction
Multi-Attempt Response
implements
A response containing multiple distinct attempts to answer the prompt, used as primary metric for ESR
Scale-Dependent ESR
extends
The observed pattern that ESR appears predominantly in the largest model tested, suggesting scale-dependence

Hypotheses (2)

hypothesis

We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representations
associated_with
Open question about the developmental origin of ESR mechanisms
We hypothesize earlier-layer interventions allow more downstream computation to process and potentially correct the perturbation
associated_with
Post-hoc explanation for why steering at layer 33 rather than layer 50 produced better ESR behavior in Llama-3.3-70B

Artifacts (1)

artifact

github.com/agencyenterprise/endogenoussteering-resistance
about
Code repository released with the paper for reproducibility

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

steering (intervention on internals)concept0.768
General technique of modifying activations to control model behavior.
direction-based steeringconcept0.766
Paradigm of finding the right direction in activation space (e.g., linear steering).
All-token steeringmethod0.766
Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
steering vectorsconcept0.760
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Bidirectional Steeringconcept0.755
Ability to steer model behavior in two opposite semantic directions on a trait.
linear steeringmethod0.751
Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
Open-Ended Generation Steeringconcept0.746
Task of steering LLM free-text responses toward psychological constructs; the primary evaluation regime where injections outperform prompting
Representation Steeringconcept0.745
Parent concept; the practice of controlling neural network outputs by manipulating internal representations.