Meta-Prompting for ESR Enhancement

Appending instructional meta-prompts to object-level prompts to deliberately enhance ESR in models

Neighborhood — ranked by edge-count

paper

concept

Endogenous Steering Resistance
about
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs

2-hop · via this method's ideas

Where ideas in this method connect to the rest of the corpus — the same concept, an analogy, or a restatement elsewhere.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Meta-prompt ESR enhancement effects scale with model size across Llama and Gemma familiesfinding0.781
Suggests underlying self-monitoring circuits must be present for meta-prompting to enhance them
Meta-Reflection Prompts as Drift Triggersconcept0.769
User messages that push the model to reflect on its own processes reliably cause persona drift away from the Assistant
The meta-prompting scaling pattern suggests underlying self-monitoring circuits must already be present for prompting to enhance themclaim0.754
Mechanistic interpretation of why meta-prompting effects scale with model size
Explicit ESRconcept0.734
The form of ESR focused on in this paper, measured by verbal self-interruption phrases as segment boundaries
ESR differs from the Hydra Effect in that ESR involves active, online detection and correction with explicit self-interruption tokensclaim0.727
Distinguishes ESR from prior work on model self-repair
How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?question0.726
Key open question for AI safety implications of ESR
Can ESR be adversarially circumvented?question0.725
Open security question about robustness of ESR-based defenses
Meta-prompting increases Llama-3.3-70B multi-attempt rate 4.3× (from 7.4% to 31.7%)finding0.716
Demonstrates ESR can be deliberately enhanced through prompting in the largest model