finding
active
finding:automated-logit-weight-prediction-achieves-74-mean-accuracy-for-features-vs-58-for-neurons-vs-50-chanceAutomated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chance
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6finding0.799Shows that signal integration into explicit prediction has barely begun immediately after injection
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
- Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
- Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
- Shows that loss recovery increases with autoencoder size
- Quantitative relationship between concept frequency and feature presence.
- Demonstrates that activation similarity can diverge from logit weight similarity due to interference
- Table 2, row 3, showing equivalence when prior preferences match rewards.