concept
active
concept:endogenous-steering-resistanceEndogenous Steering Resistance
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (2)
framework
- Attention Schema Theoryanalogous_toTheory by Graziano linking consciousness to a predictive model of attention; listed in Butlin et al. 2023.
- Representation EngineeringcontradictsA class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
Communities (1)
community
- LLM Introspectionassociated_with
Methods (2)
method
- Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
- Appending instructional meta-prompts to object-level prompts to deliberately enhance ESR in models
Concepts (14)
concept
- AI Alignment and Safetyassociated_withThe broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
- Internal Consistency MonitoringimplementsThe inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs
- Top-Down Attentional Controlanalogous_toBiological analogue to ESR where top-down mechanisms detect distracting inputs and redirect processing
- Explicit ESRextendsThe form of ESR focused on in this paper, measured by verbal self-interruption phrases as segment boundaries
- Implicit ESRextendsForm of ESR occurring without explicit verbal self-interruption markers, not captured by current metrics
- Pre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering
- Hydra EffectcontradictsPhenomenon where layer ablations trigger silent downstream compensation, contrasted with ESR
- LLM Self-CorrectionextendsRelated capability where LLMs correct their own outputs, studied via linear representations.
- Pre-filtering step excluding abstract latents where off-topic detection is harder
- Score delta between last and first attempt for multi-attempt responses, measuring correction effectiveness
- ESR Rate (metric)implementsPrimary metric: percentage of responses containing multiple attempts that successfully improve on the first attempt
- Multi-Attempt Rate (metric)implementsSecondary metric: percentage of responses containing multiple attempts, separating surface from actual self-correction
- Multi-Attempt ResponseimplementsA response containing multiple distinct attempts to answer the prompt, used as primary metric for ESR
- Scale-Dependent ESRextendsThe observed pattern that ESR appears predominantly in the largest model tested, suggesting scale-dependence
Hypotheses (2)
hypothesis
- We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representationsassociated_withOpen question about the developmental origin of ESR mechanisms
- We hypothesize earlier-layer interventions allow more downstream computation to process and potentially correct the perturbationassociated_withPost-hoc explanation for why steering at layer 33 rather than layer 50 produced better ESR behavior in Llama-3.3-70B
Artifacts (1)
artifact
- Code repository released with the paper for reproducibility
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- General technique of modifying activations to control model behavior.
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Ability to steer model behavior in two opposite semantic directions on a trait.
- Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
- Task of steering LLM free-text responses toward psychological constructs; the primary evaluation regime where injections outperform prompting
- Parent concept; the practice of controlling neural network outputs by manipulating internal representations.