method
active
method:logit-weight-similarity

Logit Weight Similarity

Correlating logit weight vectors between features from different models as a measure of downstream-effect universality

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
  • Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
  • Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
  • global logit shiftconcept0.719
    The methodological confound identified by this paper: injection biases model toward 'YES' for any binary question regardless of content
  • Self-Similarityconcept0.719
    Structural and functional property exhibited by living systems but currently absent from most engineered machines.
  • Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
  • Similarity measured with respect to network behavior/function rather than statistical correlation of activations.
  • Logit Lensmethod0.704
    Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.