hypothesis

active

hypothesis:h11-roleplay-fine-tuning-actively-suppresses-self-observation-rather-than-merely-failing-to-enhance-it

H11: Roleplay fine-tuning actively suppresses self-observation rather than merely failing to enhance it.

Exploratory hypothesis supported by Euryale scoring below base Llama

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Findings (2)

finding

Euryale 70B (roleplay LoRA on Llama 3.3 70B) scores 1.81, below its base model Llama 3.3 70B at 1.91
supports
Demonstrates roleplay fine-tuning actively suppresses self-observation, not merely having no effect
Euryale 70B lifts only +1.57 (to 3.38); LoRA fine-tuning capped both default accessibility and latent capacity
supports
Contrast with Magnum shows LoRA vs full fine-tuning difference in residual headroom

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Roleplay Fine-Tuningconcept0.804
Fine-tuning for persona depth and emotional performance; actively suppresses self-observation
The role-play framing remains applicable in the context of fine-tuning; taking literally a fine-tuned agent's apparent self-preservation desire is no less problematic than with an untuned base modelhypothesis0.788
Extension of role-play framework to fine-tuned models, resisting the idea that RLHF changes the fundamental nature of simulacra
Fine-tuning can be likened to imposing a kind of censorship on the simulator; it leaves the underlying range of roles essentially the same but compromises authenticityclaim0.775
Extends the role-play framing to explain the effect of RLHF on dialogue agents
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.773
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.772
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.772
Future work hypothesis about extending SOO to direct value alignment
Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitorclaim0.771
Normative-scientific claim about the alignment implications of Experiment 2's findings
SOO fine-tuning could complement RLHF and Constitutional AI by fostering internal coherence that promotes honest behaviorsclaim0.765
Integration claim positioning SOO as additive to existing alignment approaches