hypothesis
active
hypothesis:we-hypothesize-that-measuring-deviations-along-the-assistant-axis-can-predict-persona-drift-leading-to-harmful-or-bizarre-behaviors

We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviors

Core predictive hypothesis linking activation representations to behavioral outcomes

Source paper

extracted_from
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (2)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.