method
active
method:logit-weight-analysis

Logit Weight Analysis

Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Concepts (1)

concept
  • Anomalous Tokens
    associated_with
    Extremely rare or never-used vocabulary elements that may distort logit weight analysis; excluded from feature analysis

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Correlating logit weight vectors between features from different models as a measure of downstream-effect universality
  • Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
  • Parameter-free loss transformation applied to each task loss to equalize scales
  • Logit Lensmethod0.729
    Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
  • Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
  • Task weightconcept0.723
    Coefficient weighting each task loss in the MTL objective.
  • Logistic regression trained on GSM8k training set to predict answer correctness from projection features along reflection direction
  • Method used by Alexander personally for three whole nights to analyze the tracery truss of the Julian Street Inn dining hall.