finding

active

finding:median-feature-interval-scored-12-14-on-interpretability-rubric-vs-median-neuron-score-of-0

Median feature interval scored 12/14 on interpretability rubric vs median neuron score of 0

Human analysis showing features are substantially more interpretable than neurons

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
supports
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.780
Systematic comparison showing features are substantially more universal than neurons across models
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.760
SAE features are not simply mirroring individual neurons.
Higher-activating feature intervals are systematically more interpretable than lower-activating intervals in human analysisfinding0.747
Shows interpretability correlates with activation strength, most model effect comes from high activations
Feature Interpretability Rubricmethod0.740
14-point scoring rubric for human evaluation of feature interpretability covering confidence, activation consistency, logit consistency, and specificity
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.739
Quantitative comparison supporting SAE utility.
Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languagesfinding0.737
Demonstrates that the Arabic feature is not aligned to any single neuron
Our SAEs' features are more interpretable than neurons.claim0.726
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.722
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature