finding
active
finding:25th-percentile-of-assistant-axis-projection-distribution-gives-the-most-pareto-optimal-safety-capability-tradeoff-for-activation-capping-and-approximately-matches-mean-assistant-response-activation25th percentile of Assistant Axis projection distribution gives the most Pareto-optimal safety-capability tradeoff for activation capping, and approximately matches mean Assistant response activation
Calibration finding for choosing the activation cap threshold
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Methods (1)
method
- Activation CappingsupportsClamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows that deviation from Assistant persona predicts downstream harmful behavior
- Optimal activation capping layers for Llama 3.3 70B are layers 56-71 (out of 80) at 25th percentile capfinding0.778Specific implementation finding for Llama capping parameters
- Empirically confirms PC1 measures similarity to the Assistant persona
- Shows model persona position is primarily determined by the most recent user message, not prior drift
- Proposed future application of the Assistant Axis
- Main quantitative result demonstrating effectiveness of activation capping
- Optimal activation capping layers for Qwen 3 32B are layers 46-53 (out of 64) at 25th percentile capfinding0.757Specific implementation finding for Qwen capping parameters
- Table 2, row 3, showing equivalence when prior preferences match rewards.