hypothesis

active

hypothesis:the-role-play-framing-remains-applicable-in-the-context-of-fine-tuning-taking-literally-a-fine-tuned-agent-s-apparent-self-preservation-desire-is-no-less-problematic-than-with-an-untuned-base-model

The role-play framing remains applicable in the context of fine-tuning; taking literally a fine-tuned agent's apparent self-preservation desire is no less problematic than with an untuned base model

Extension of role-play framework to fine-tuned models, resisting the idea that RLHF changes the fundamental nature of simulacra

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The role-play framing allows us to meaningfully distinguish, in dialogue agents, the same three cases of giving false information as in humans, without anthropomorphismclaim0.835
Key practical application of the role-play framework to the problem of trustworthiness
Fine-tuning can be likened to imposing a kind of censorship on the simulator; it leaves the underlying range of roles essentially the same but compromises authenticityclaim0.802
Extends the role-play framing to explain the effect of RLHF on dialogue agents
Roleplay Fine-Tuningconcept0.792
Fine-tuning for persona depth and emotional performance; actively suppresses self-observation
What exactly would the dialogue agent (role-play to) seek to preserve?question0.788
Operationalised question about self-preservation behaviour in dialogue agents
H11: Roleplay fine-tuning actively suppresses self-observation rather than merely failing to enhance it.hypothesis0.788
Exploratory hypothesis supported by Euryale scoring below base Llama
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.784
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
Fine-tuning as character formation: what kinds of selves are produced through training is an open research direction.claim0.772
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.772
Future work hypothesis about extending SOO to direct value alignment