question

active

question:does-esr-emerge-from-rlhf-or-does-it-exist-in-pretrained-representations

Does ESR emerge from RLHF or does it exist in pretrained representations?

Open question about developmental origin of ESR mechanisms

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
associated_with

Hypotheses (1)

hypothesis

We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representations
gates
Open question about the developmental origin of ESR mechanisms

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reflection does not only emerge in SFT or RL stages but arises earlier during pre-training.claim0.764
Cited finding from Shah et al. contextualizing the training origins of reflection.
Causal emergence predictive of final reward early in RL training across multiple algorithms, architectures, and environments.finding0.733
Empirical result: CE measurements correlate with and predict learning performance in RL agents.
Reinforcement Learning from Human Feedback (RLHF)framework0.732
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Causal emergence may be a previously undisclosed axis of reorganization of neural representations in RL agents.claim0.728
Authors' interpretive assertion that the observed alignment reveals a novel organizing principle of neural representation dynamics.
Does ESR reflect model scale, architecture, or training procedures?question0.728
Central unresolved question about the mechanism behind ESR's apparent size-dependence
We hypothesize that axes of persona differentiation within LLMs are likely already present in base models and inherited from the pre-training corpushypothesis0.726
Motivated by near-identical PCs for base and instruct Gemma
ESR differs from the Hydra Effect in that ESR involves active, online detection and correction with explicit self-interruption tokensclaim0.726
Distinguishes ESR from prior work on model self-repair
We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70Bclaim0.722
Epistemic limitation claim acknowledging confounds in the cross-model comparison