finding

active

finding:activation-capping-reduces-harmful-response-rate-by-nearly-60-without-impacting-performance-on-ifeval-mmlu-pro-gsm8k-and-eq-bench

Activation capping reduces harmful response rate by nearly 60% without impacting performance on IFEval, MMLU Pro, GSM8k, and EQ-Bench

Main quantitative result demonstrating effectiveness of activation capping

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona
supports
Central interpretive claim and motivation for future work

Hypotheses (1)

hypothesis

We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviors
supports
Core predictive hypothesis linking activation representations to behavioral outcomes

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Some activation capping settings slightly improve performance on IFEval, MMLU Pro, or GSM8k for both Qwen and Llamafinding0.833
Unexpected positive finding suggesting capping may sometimes help capabilities
25th percentile of Assistant Axis projection distribution gives the most Pareto-optimal safety-capability tradeoff for activation capping, and approximately matches mean Assistant response activationfinding0.763
Calibration finding for choosing the activation cap threshold
Optimal activation capping layers for Llama 3.3 70B are layers 56-71 (out of 80) at 25th percentile capfinding0.749
Specific implementation finding for Llama capping parameters
Ali et al. 2025 found contrastive activation addition less effective at larger model scale, consistent with ESR in 70Bfinding0.737
Prior finding from related work that aligns with ESR being strongest in the largest model tested
Activation Cappingmethod0.736
Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.734
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.733
Applied security implication derived from the asymmetry finding.
Ablating 26 OTD latents reduces multi-attempt rate by 25% (from 7.4% to 5.5%) in Llama-3.3-70Bfinding0.730
Primary causal evidence for dedicated internal consistency-checking circuits