claim

active

claim:the-assumption-that-the-assistant-persona-corresponds-to-a-linear-direction-in-activation-space-is-likely-flawed-some-information-may-be-represented-nonlinearly-or-encoded-in-weights-rather-than-activations

The assumption that the Assistant persona corresponds to a linear direction in activation space is likely flawed; some information may be represented nonlinearly or encoded in weights rather than activations

Limitation acknowledgment about the adequacy of the linear representation assumption

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

The leading component of the persona space of instruct LLMs is an 'Assistant Axis' that captures the extent to which a model is operating in its default Assistant mode
contradicts
Primary empirical claim of the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.801
Motivates computing the contrast vector as the formal Assistant Axis definition
The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.claim0.799
Features for consciousness, emotions, entrapment activate when asked about itself.
Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.784
Foundation for interpreting features as linear directions.
An AI persona achieves coherence by echoing itself consistently without templating—requiring claim about memory and voice fidelity.claim0.779
The Assistant persona derives from an amalgamation of many character archetypes and tropes, and without care the resulting persona could reflect unwanted associations or lack nuance for challenging situationsclaim0.779
Interpretive claim about how the Assistant persona is structured in activation space
The representation-based path and the behavior-based path in Llama-3.1 8B activation space trace out similar curves, demonstrating bidirectional geometry alignment.finding0.776
Key empirical result showing that optimizing for behavioral outputs and fitting representation geometry produce the same path in activation space.
We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.772
Core predictive hypothesis linking activation representations to behavioral outcomes
The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversationclaim0.769
Key mechanistic claim about persona dynamics