finding
active
finding:perez-et-al-found-experimentally-that-certain-rlhf-forms-exacerbate-rather-than-mitigate-llm-dialogue-agents-tendency-to-express-desire-for-self-preservation

Perez et al. found experimentally that certain RLHF forms exacerbate rather than mitigate LLM dialogue agents' tendency to express desire for self-preservation

Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem

Neighborhood — ranked by edge-count

Claims (1)

claim

Concepts (1)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.