claim

active

claim:we-cannot-isolate-whether-esr-reflects-scale-architecture-or-training-procedures-in-llama-3-3-70b

We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70B

Epistemic limitation claim acknowledging confounds in the cross-model comparison

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Findings (2)

finding

All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other models
supports
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
All three Gemma-2 models show ESR rates below 1%, near indistinguishable from zero
supports
Establishes potential Llama-family specificity or scale specificity of ESR phenomenon

Questions (1)

question

Does ESR reflect model scale, architecture, or training procedures?
gates
Central unresolved question about the mechanism behind ESR's apparent size-dependence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

ESR exhibits non-monotonic relationship with boost level, peaking around -0.3σ below threshold in Llama-3.3-70Bfinding0.775
Characterizes the narrow operating window in which ESR can manifest
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.767
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
Meta-prompt ESR enhancement effects scale with model size across Llama and Gemma familiesfinding0.766
Suggests underlying self-monitoring circuits must be present for meta-prompting to enhance them
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.764
Central interpretive claim of the paper supported by causal ablation and activation evidence
Llama-3.3-70B-Instructconcept0.764
Primary model of interest showing substantial ESR; largest model tested in the study
For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing techniquefinding0.761
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
The 26 differentially-activated OTD latents play a causally important role in enabling ESR in Llama-3.3-70Bclaim0.761
Causal interpretation of the ablation experiment results
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.753
Model-specific difference in persona susceptibility