concept
active
concept:endogenous-steering-resistance

Endogenous Steering Resistance

The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs

Neighborhood — ranked by edge-count

Frameworks (2)

framework
  • Theory by Graziano linking consciousness to a predictive model of attention; listed in Butlin et al. 2023.
  • A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework

Communities (1)

community

Methods (2)

method

Concepts (14)

concept
  • The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
  • The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs
  • Biological analogue to ESR where top-down mechanisms detect distracting inputs and redirect processing
  • The form of ESR focused on in this paper, measured by verbal self-interruption phrases as segment boundaries
  • Form of ESR occurring without explicit verbal self-interruption markers, not captured by current metrics
  • Pre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering
  • Hydra Effect
    contradicts
    Phenomenon where layer ablations trigger silent downstream compensation, contrasted with ESR
  • Related capability where LLMs correct their own outputs, studied via linear representations.
  • Pre-filtering step excluding abstract latents where off-topic detection is harder
  • Score delta between last and first attempt for multi-attempt responses, measuring correction effectiveness
  • Primary metric: percentage of responses containing multiple attempts that successfully improve on the first attempt
  • Secondary metric: percentage of responses containing multiple attempts, separating surface from actual self-correction
  • A response containing multiple distinct attempts to answer the prompt, used as primary metric for ESR
  • The observed pattern that ESR appears predominantly in the largest model tested, suggesting scale-dependence

Hypotheses (2)

hypothesis

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • General technique of modifying activations to control model behavior.
  • Paradigm of finding the right direction in activation space (e.g., linear steering).
  • Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
  • steering vectorsconcept0.760
    A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
  • Ability to steer model behavior in two opposite semantic directions on a trait.
  • linear steeringmethod0.751
    Typical approach that adds a scaled steering vector to representations; the paper argues this is mismatched with actual representation geometry.
  • Task of steering LLM free-text responses toward psychological constructs; the primary evaluation regime where injections outperform prompting
  • Parent concept; the practice of controlling neural network outputs by manipulating internal representations.