hypothesis
active
hypothesis:we-hypothesize-esr-may-emerge-from-rlhf-training-rather-than-existing-in-pretrained-representationsWe hypothesize ESR may emerge from RLHF training rather than existing in pretrained representations
Open question about the developmental origin of ESR mechanisms
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Endogenous Steering Resistanceassociated_withThe central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
Questions (1)
question
- Open question about developmental origin of ESR mechanisms
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- Reflection does not only emerge in SFT or RL stages but arises earlier during pre-training.claim0.760Cited finding from Shah et al. contextualizing the training origins of reflection.
- Empirical result: CE measurements correlate with and predict learning performance in RL agents.
- Central threat model claim derived from RL experimental results
- Central unresolved question about the mechanism behind ESR's apparent size-dependence
- We hypothesize ESR might be adversarially circumvented through targeted interventionshypothesis0.747Open safety-relevant question about whether ESR can be bypassed
- Foundational RLHF paper introducing HHH training objective for Claude
- Finding that base models have high false positives and no net positive performance.