Does ESR reflect model scale, architecture, or training procedures?

Central unresolved question about the mechanism behind ESR's apparent size-dependence

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
associated_with

Claims (1)

claim

We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70B
gates
Epistemic limitation claim acknowledging confounds in the cross-model comparison

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Scale-Dependent ESRconcept0.778
The observed pattern that ESR appears predominantly in the largest model tested, suggesting scale-dependence
We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representationshypothesis0.751
Open question about the developmental origin of ESR mechanisms
What predicts self-observation-like scores is training approach (alignment type), not model size or architecture.claim0.742
Central interpretive claim from statistical analysis
Interpretability features converge across different model architectures, revealing structural similarities.claim0.740
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.738
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
How do representations differ or converge between architectures, tasks, and modalities?question0.738
Broader research question MAS is positioned to address, citing multiple recent works.
Architectural form should be calibrated with forces of environment through artificial design methods, extending natural processes of interaction between specific quantifiable forces.claim0.737
Alexander's structuralist approach treating design as homeostatic adaptation analogous to biological systems.
We hypothesize that introspective capabilities may scale with model size and architecture, including recurrence/looping that extends the integration windowhypothesis0.737
Forward-looking prediction about whether early-layer introspection generalizes to larger models or recurrent architectures