method
active
method:logit-weight-analysisLogit Weight Analysis
Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Prior Anthropic paper enabling circuit-level analysis of attention-only transformers; motivates current MLP decomposition
Concepts (1)
concept
- Anomalous Tokensassociated_withExtremely rare or never-used vocabulary elements that may distort logit weight analysis; excluded from feature analysis
Claims (1)
claim
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Correlating logit weight vectors between features from different models as a measure of downstream-effect universality
- Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
- Parameter-free loss transformation applied to each task loss to equalize scales
- Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
- Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
- Coefficient weighting each task loss in the MTL objective.
- Logistic regression trained on GSM8k training set to predict answer correctness from projection features along reflection direction
- Method used by Alexander personally for three whole nights to analyze the tracery truss of the Julian Street Inn dining hall.