question
active
question:does-esr-emerge-from-rlhf-or-does-it-exist-in-pretrained-representationsDoes ESR emerge from RLHF or does it exist in pretrained representations?
Open question about developmental origin of ESR mechanisms
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Hypotheses (1)
hypothesis
- We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representationsgatesOpen question about the developmental origin of ESR mechanisms
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Reflection does not only emerge in SFT or RL stages but arises earlier during pre-training.claim0.764Cited finding from Shah et al. contextualizing the training origins of reflection.
- Empirical result: CE measurements correlate with and predict learning performance in RL agents.
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- Authors' interpretive assertion that the observed alignment reveals a novel organizing principle of neural representation dynamics.
- Central unresolved question about the mechanism behind ESR's apparent size-dependence
- Motivated by near-identical PCs for base and instruct Gemma
- Distinguishes ESR from prior work on model self-repair
- We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70Bclaim0.722Epistemic limitation claim acknowledging confounds in the cross-model comparison