finding

active

finding:meta-prompt-esr-enhancement-effects-scale-with-model-size-across-llama-and-gemma-families

Meta-prompt ESR enhancement effects scale with model size across Llama and Gemma families

Suggests underlying self-monitoring circuits must be present for meta-prompting to enhance them

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Claims (1)

claim

The meta-prompting scaling pattern suggests underlying self-monitoring circuits must already be present for prompting to enhance them
supports
Mechanistic interpretation of why meta-prompting effects scale with model size

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Meta-prompting increases Llama-3.3-70B multi-attempt rate 4.3× (from 7.4% to 31.7%)finding0.786
Demonstrates ESR can be deliberately enhanced through prompting in the largest model
Meta-Prompting for ESR Enhancementmethod0.781
Appending instructional meta-prompts to object-level prompts to deliberately enhance ESR in models
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.775
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
ESR exhibits non-monotonic relationship with boost level, peaking around -0.3σ below threshold in Llama-3.3-70Bfinding0.775
Characterizes the narrow operating window in which ESR can manifest
The generalization improvement from explicit instructions observed in Llama models (A1-A3 to F0-F2) is more pronounced for F3-F5 to F0-F2 in Gemma models.claim0.770
Shows the instruction effect, while shifting geometry, may not produce consistent generalization effects across model families.
We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70Bclaim0.766
Epistemic limitation claim acknowledging confounds in the cross-model comparison
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.762
Establishes generalizability of the core difficulty-boundary finding across model families.
All three Gemma-2 models show ESR rates below 1%, near indistinguishable from zerofinding0.761
Establishes potential Llama-family specificity or scale specificity of ESR phenomenon