concept
active
concept:bounded-task-requests-as-persona-stabilizersBounded Task Requests as Persona Stabilizers
Requests for bounded tasks, technical explanations, and how-to explainers keep the model in the Assistant persona
Neighborhood — ranked by edge-count
Claims (1)
claim
- Empirical characterization of conversation domains that are safe for model persona stability
Concepts (1)
concept
- Persona Stabilizationassociated_withKeeping a model anchored to its intended persona during deployment, preventing drift to harmful behaviors
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Design hypothesis that coarse-grained task switching (at commands only) eliminates need for protection mechanisms while maintaining usability.
- Argument that RL meets the agency indicator.
- Motivation for the proposed method.
- Limitation question motivating future work on persona elicitation strategies
- Overarching conceptual framework the paper introduces for model safety
- The default helpful, honest, and harmless character that post-trained LLMs are taught to embody
- Unintended personas introduced as a side effect of using steering vectors to reduce eval awareness.