hypothesis
active
hypothesis:h11-roleplay-fine-tuning-actively-suppresses-self-observation-rather-than-merely-failing-to-enhance-itH11: Roleplay fine-tuning actively suppresses self-observation rather than merely failing to enhance it.
Exploratory hypothesis supported by Euryale scoring below base Llama
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Findings (2)
finding
- Euryale 70B (roleplay LoRA on Llama 3.3 70B) scores 1.81, below its base model Llama 3.3 70B at 1.91supportsDemonstrates roleplay fine-tuning actively suppresses self-observation, not merely having no effect
- Contrast with Magnum shows LoRA vs full fine-tuning difference in residual headroom
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Fine-tuning for persona depth and emotional performance; actively suppresses self-observation
- Extension of role-play framework to fine-tuned models, resisting the idea that RLHF changes the fundamental nature of simulacra
- Extends the role-play framing to explain the effect of RLHF on dialogue agents
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- Future work hypothesis about extending SOO to direct value alignment
- Normative-scientific claim about the alignment implications of Experiment 2's findings
- Integration claim positioning SOO as additive to existing alignment approaches