finding
active
finding:25th-percentile-of-assistant-axis-projection-distribution-gives-the-most-pareto-optimal-safety-capability-tradeoff-for-activation-capping-and-approximately-matches-mean-assistant-response-activation

25th percentile of Assistant Axis projection distribution gives the most Pareto-optimal safety-capability tradeoff for activation capping, and approximately matches mean Assistant response activation

Calibration finding for choosing the activation cap threshold

Source paper

extracted_from
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Methods (1)

method
  • Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.