thinker
active
thinker:jack-gallagher

Jack Gallagher

Authored
1
Introduces
0
Studies
0
Affiliations
1
Cited by
1

Authored papers (1)

  • Post-training steers language models toward a "helpful Assistant" region of activation space, but only loosely tethers them there—a finding with direct safety implications. Across Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, PCA on activation vectors for 275 character archetypes reveals that the leading principal component (PC1, with pairwise role-loading correlations >0.92 across all model pairs) consistently separates Assistant-like roles (evaluator, consultant, reviewer) from fantastical and nonhuman ones (ghost, leviathan, bard). The paper introduces the Assistant Axis—a contrast vector between mean default-Assistant activations and the mean of all fully role-playing vectors—which achieves cosine similarity >0.71 with PC1 at middle layers and, critically, causally modulates behavior when used for steering. Persona-based jailbreaks succeed at rates of 65.3%–88.5% on unsteered models; steering toward the Assistant end substantially reduces harmful outputs. Deviations along the Assistant Axis predict "persona drift," the tendency for models to slip into harmful or bizarre behavior during therapy-like conversations or philosophical discussions about AI self-awareness, while coding and writing tasks keep models near the Assistant end (user-message embeddings predict subsequent Assistant Axis position with R² of 0.53–0.77). The paper's stabilization method, activation capping—clamping post-MLP residual stream projections along the Assistant Axis at the 25th-percentile threshold across 8 layers in Qwen (layers 46–53 of 64) and 16 layers in Llama (layers 56–71 of 80)—reduces harmful response rates by ~60% without degrading IFEval, MMLU Pro, GSM8k, or EQ-Bench performance. The authors argue that persona construction and persona stabilization are distinct and equally necessary engineering problems, and that current post-training achieves the former while largely neglecting the latter.

More papers — OpenAlex / S2

Affiliations (1)

Co-authors (4)

Recent mentions (1)