Activation Capping

Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method

Neighborhood — ranked by edge-count

paper

framework

Assistant Axis
implements
Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper

finding

concept

Persona drift
associated_with
Behavioural drift in multi-turn LLM interaction; documented in prior work for persona, identity, and instruction-following

method

Activation Steering
extends
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation Compressionconcept0.806
Key capability: covariance pooling compresses gigabytes of activations into compact stable embeddings without large labeled datasets.
Activation patchingmethod0.786
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Activationsconcept0.786
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
How can activation capping or preventative steering be productionized for deployment at scale?question0.773
Open engineering challenge identified in future work section
Activation Probingconcept0.767
Technique of reading out model beliefs from internal activations before the final answer token is generated
Activation Additionmethod0.765
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Other-Referencing Activationsconcept0.755
Latent model activations when processing inputs framed from another agent's perspective
Bottleneck Activationconcept0.751
A lower-dimensional activation that is the only pathway for information between higher-dimensional activations; e.g. the residual stream between MLP layers