method
active
method:activation-capping

Activation Capping

Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method

Neighborhood — ranked by edge-count

Frameworks (1)

framework
  • Assistant Axis
    implements
    Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper

Concepts (1)

concept
  • Persona drift
    associated_with
    Behavioural drift in multi-turn LLM interaction; documented in prior work for persona, identity, and instruction-following

Methods (1)

method
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Key capability: covariance pooling compresses gigabytes of activations into compact stable embeddings without large labeled datasets.
  • Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
  • Activationsconcept0.786
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Open engineering challenge identified in future work section
  • Activation Probingconcept0.767
    Technique of reading out model beliefs from internal activations before the final answer token is generated
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Latent model activations when processing inputs framed from another agent's perspective
  • A lower-dimensional activation that is the only pathway for information between higher-dimensional activations; e.g. the residual stream between MLP layers