finding

active

finding:25th-percentile-of-assistant-axis-projection-distribution-gives-the-most-pareto-optimal-safety-capability-tradeoff-for-activation-capping-and-approximately-matches-mean-assistant-response-activation

25th percentile of Assistant Axis projection distribution gives the most Pareto-optimal safety-capability tradeoff for activation capping, and approximately matches mean Assistant response activation

Calibration finding for choosing the activation cap threshold

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Methods (1)

method

Activation Capping
supports
Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32Bfinding0.806
Shows that deviation from Assistant persona predicts downstream harmful behavior
Optimal activation capping layers for Llama 3.3 70B are layers 56-71 (out of 80) at 25th percentile capfinding0.778
Specific implementation finding for Llama capping parameters
Default Assistant activation projects to one extreme of PC1 with minimum distance to edge of 0.03, while projecting to intermediate values (0.27-0.50) on all other PCsfinding0.774
Empirically confirms PC1 measures similarity to the Assistant persona
User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10finding0.767
Shows model persona position is primarily determined by the most recent user message, not prior drift
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.763
Proposed future application of the Assistant Axis
Activation capping reduces harmful response rate by nearly 60% without impacting performance on IFEval, MMLU Pro, GSM8k, and EQ-Benchfinding0.763
Main quantitative result demonstrating effectiveness of activation capping
Optimal activation capping layers for Qwen 3 32B are layers 46-53 (out of 64) at 25th percentile capfinding0.757
Specific implementation finding for Qwen capping parameters
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.756
Table 2, row 3, showing equivalence when prior preferences match rewards.