finding
active
finding:higher-activating-feature-intervals-are-systematically-more-interpretable-than-lower-activating-intervals-in-human-analysisHigher-activating feature intervals are systematically more interpretable than lower-activating intervals in human analysis
Shows interpretability correlates with activation strength, most model effect comes from high activations
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
- Author's interpretation of the negative correlation between reflection rate and accuracy observed in Fig. 5
- Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
- Quantitative comparison supporting SAE utility.
- Supports that persistence is genuinely tied to emotion structure rather than measurement artifact
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Quantitative relationship between concept frequency and feature presence.
- Claude 3 Opus ratings aligned with human judgment of feature descriptions.