Attention Block Output Activation

The specific activation representation used: output of ℓ-th attention block = MLP output + residual stream.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation Additionmethod0.782
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activationsconcept0.775
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
attention computationconcept0.768
Process using Q, K, V to compute a heat map over K and weighted sum of V.
attention mechanismconcept0.745
Core operation in transformers, computing weighted combinations of previous elements
Activation decompositionconcept0.734
The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.
Activation Similarityconcept0.734
Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
Attention Schemaconcept0.730
A predictive model representing and controlling attention; central to attention schema theory.
Activation Oracles (AO)method0.729
Supervised method training models to answer questions about activations; NLAs differ by being unsupervised.