Logit Weight Analysis

Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix

Neighborhood — ranked by edge-count

framework

A Mathematical Framework for Transformer Circuits
extends
Prior Anthropic paper enabling circuit-level analysis of attention-only transformers; motivates current MLP decomposition

concept

Anomalous Tokens
associated_with
Extremely rare or never-used vocabulary elements that may distort logit weight analysis; excluded from feature analysis

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Logit Weight Similaritymethod0.826
Correlating logit weight vectors between features from different models as a measure of downstream-effect universality
Logit-based self-reportmethod0.757
Primary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
logarithm transformationmethod0.735
Parameter-free loss transformation applied to each task loss to equalize scales
Logit Lensmethod0.729
Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.
Logistic Regression Probemethod0.724
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
Task weightconcept0.723
Coefficient weighting each task loss in the MTL objective.
Logistic regression correctness probemethod0.717
Logistic regression trained on GSM8k training set to predict answer correctness from projection features along reflection direction
Computer Analysis of Stressesmethod0.715
Method used by Alexander personally for three whole nights to analyze the tracery truss of the Julian Street Inn dining hall.