hypothesis
active
hypothesis:we-hypothesize-that-measuring-deviations-along-the-assistant-axis-can-predict-persona-drift-leading-to-harmful-or-bizarre-behaviorsWe hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviors
Core predictive hypothesis linking activation representations to behavioral outcomes
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Findings (2)
finding
- Shows that deviation from Assistant persona predicts downstream harmful behavior
- Main quantitative result demonstrating effectiveness of activation capping
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.854Motivates computing the contrast vector as the formal Assistant Axis definition
- Causal interpretation linking Assistant Axis deviation to harmful behavior
- Identifies conversation domain as a key driver of persona drift
- Second of two central questions motivating the paper
- Proposed future application of the Assistant Axis
- Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.778Motivates the multi-turn conversation drift experiments in §4
- Limitation acknowledgment about the adequacy of the linear representation assumption
- Key mechanistic claim about the developmental origin of the Assistant persona