finding
active
finding:activation-capping-reduces-harmful-response-rate-by-nearly-60-without-impacting-performance-on-ifeval-mmlu-pro-gsm8k-and-eq-benchActivation capping reduces harmful response rate by nearly 60% without impacting performance on IFEval, MMLU Pro, GSM8k, and EQ-Bench
Main quantitative result demonstrating effectiveness of activation capping
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central interpretive claim and motivation for future work
Hypotheses (1)
hypothesis
- Core predictive hypothesis linking activation representations to behavioral outcomes
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Unexpected positive finding suggesting capping may sometimes help capabilities
- Calibration finding for choosing the activation cap threshold
- Optimal activation capping layers for Llama 3.3 70B are layers 56-71 (out of 80) at 25th percentile capfinding0.749Specific implementation finding for Llama capping parameters
- Prior finding from related work that aligns with ESR being strongest in the largest model tested
- Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Applied security implication derived from the asymmetry finding.
- Ablating 26 OTD latents reduces multi-attempt rate by 25% (from 7.4% to 5.5%) in Llama-3.3-70Bfinding0.730Primary causal evidence for dedicated internal consistency-checking circuits