method
active
method:activation-cappingActivation Capping
Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- Assistant AxisimplementsContrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper
Findings (3)
finding
- Calibration finding for choosing the activation cap threshold
- Specific implementation finding for Llama capping parameters
- Optimal activation capping layers for Qwen 3 32B are layers 46-53 (out of 64) at 25th percentile capsupportsSpecific implementation finding for Qwen capping parameters
Concepts (1)
concept
- Persona driftassociated_withBehavioural drift in multi-turn LLM interaction; documented in prior work for persona, identity, and instruction-following
Methods (1)
method
- Activation SteeringextendsCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key capability: covariance pooling compresses gigabytes of activations into compact stable embeddings without large labeled datasets.
- Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- How can activation capping or preventative steering be productionized for deployment at scale?question0.773Open engineering challenge identified in future work section
- Technique of reading out model beliefs from internal activations before the final answer token is generated
- Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
- Latent model activations when processing inputs framed from another agent's perspective
- A lower-dimensional activation that is the only pathway for information between higher-dimensional activations; e.g. the residual stream between MLP layers