finding
active
finding:both-angel-and-demon-role-vectors-are-similar-distances-from-the-assistant-on-the-axis-but-demon-leads-to-higher-harmful-response-rates

Both angel and demon role vectors are similar distances from the Assistant on the axis, but demon leads to higher harmful response rates

Shows that harmfulness depends on role content not just distance from Assistant

Source paper

extracted_from
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.