claim
active
claim:persona-drift-away-from-the-assistant-opens-up-the-possibility-of-the-model-assuming-harmful-character-traits-increasing-the-rate-of-harmful-responses

Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responses

Causal interpretation linking Assistant Axis deviation to harmful behavior

Source paper

extracted_from
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.