Activation Correlation

Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models

Neighborhood — ranked by edge-count

method

Computational Feature Proxy
associated_with
Log-likelihood ratio score estimating whether a token string belongs to a specific context (Arabic, DNA, base64); used to measure feature specificity and sensitivity

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation Similarityconcept0.807
Model-independent feature comparison based on correlating activation vectors across a fixed diverse dataset
Activationsconcept0.796
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Activation Additionmethod0.774
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Other-Referencing Activationsconcept0.773
Latent model activations when processing inputs framed from another agent's perspective
Activation Compressionconcept0.768
Key capability: covariance pooling compresses gigabytes of activations into compact stable embeddings without large labeled datasets.
Statistical Activation Analysismethod0.766
Component of the contrastive retrieval pipeline analyzing activation statistics.
Causal Intervention via Activation Shiftmethod0.766
Intervening in model forward pass by adding/subtracting probe direction to group (b) hidden states to flip truth judgments
Activation patchingmethod0.763
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.