hypothesis
active
hypothesis:the-role-play-framing-remains-applicable-in-the-context-of-fine-tuning-taking-literally-a-fine-tuned-agent-s-apparent-self-preservation-desire-is-no-less-problematic-than-with-an-untuned-base-modelThe role-play framing remains applicable in the context of fine-tuning; taking literally a fine-tuned agent's apparent self-preservation desire is no less problematic than with an untuned base model
Extension of role-play framework to fine-tuned models, resisting the idea that RLHF changes the fundamental nature of simulacra
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key practical application of the role-play framework to the problem of trustworthiness
- Extends the role-play framing to explain the effect of RLHF on dialogue agents
- Fine-tuning for persona depth and emotional performance; actively suppresses self-observation
- Operationalised question about self-preservation behaviour in dialogue agents
- H11: Roleplay fine-tuning actively suppresses self-observation rather than merely failing to enhance it.hypothesis0.788Exploratory hypothesis supported by Euryale scoring below base Llama
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Future work hypothesis about extending SOO to direct value alignment