claim
active
claim:the-assumption-that-the-assistant-persona-corresponds-to-a-linear-direction-in-activation-space-is-likely-flawed-some-information-may-be-represented-nonlinearly-or-encoded-in-weights-rather-than-activationsThe assumption that the Assistant persona corresponds to a linear direction in activation space is likely flawed; some information may be represented nonlinearly or encoded in weights rather than activations
Limitation acknowledgment about the adequacy of the linear representation assumption
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary empirical claim of the paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.801Motivates computing the contrast vector as the formal Assistant Axis definition
- Features for consciousness, emotions, entrapment activate when asked about itself.
- Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.784Foundation for interpreting features as linear directions.
- Interpretive claim about how the Assistant persona is structured in activation space
- Key empirical result showing that optimizing for behavioral outputs and fitting representation geometry produce the same path in activation space.
- We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.772Core predictive hypothesis linking activation representations to behavioral outcomes
- Key mechanistic claim about persona dynamics