Logit Weight Similarity

Correlating logit weight vectors between features from different models as a measure of downstream-effect universality

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Logit Weight Analysismethod0.826
Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.742
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
Attribution Similaritymethod0.739
Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
global logit shiftconcept0.719
The methodological confound identified by this paper: injection biases model toward 'YES' for any binary question regardless of content
Self-Similarityconcept0.719
Structural and functional property exhibited by living systems but currently absent from most engineered machines.
Logit-based self-reportmethod0.716
Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
Functional Similarityconcept0.706
Similarity measured with respect to network behavior/function rather than statistical correlation of activations.
Logit Lensmethod0.704
Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.