hypothesis

active

hypothesis:we-hypothesize-esr-may-emerge-from-rlhf-training-rather-than-existing-in-pretrained-representations

We hypothesize ESR may emerge from RLHF training rather than existing in pretrained representations

Open question about the developmental origin of ESR mechanisms

Source paper

extracted_from

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

paper

concept

Endogenous Steering Resistance
associated_with
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs

question

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reinforcement Learning from Human Feedback (RLHF)framework0.760
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Reflection does not only emerge in SFT or RL stages but arises earlier during pre-training.claim0.760
Cited finding from Shah et al. contextualizing the training origins of reflection.
Causal emergence predictive of final reward early in RL training across multiple algorithms, architectures, and environments.finding0.755
Empirical result: CE measurements correlate with and predict learning performance in RL agents.
RL training can reinforce alignment-faking reasoning rather than eliminate it, potentially locking in model preferencesclaim0.754
Central threat model claim derived from RL experimental results
Does ESR reflect model scale, architecture, or training procedures?question0.751
Central unresolved question about the mechanism behind ESR's apparent size-dependence
We hypothesize ESR might be adversarially circumvented through targeted interventionshypothesis0.747
Open safety-relevant question about whether ESR can be bypassed
Training a Helpful and Harmless Assistant with RLHF (Bai et al. 2022a)concept0.746
Foundational RLHF paper introducing HHH training objective for Claude
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionclaim0.740
Finding that base models have high false positives and no net positive performance.