finding

active

finding:claude-achieves-significantly-higher-spearman-correlation-predicting-feature-activations-vs-neuron-activations

Claude achieves significantly higher Spearman correlation predicting feature activations vs neuron activations

Automated interpretability analysis of activations confirms features are more interpretable than neurons

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
supports
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.803
Systematic comparison showing features are substantially more universal than neurons across models
Most correlated neuron A/neurons/470 has correlation of only 0.18 with base64 feature A/1/2357 and responds to code, HTML labels, URLsfinding0.802
Shows base64 feature is polysemantic at neuron level but monosemantic as learned feature
Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105finding0.780
Validates robustness of alignment metric choice
Models with 1-hot activation sparsity still have polysemantic neurons; single neuron trained on 4 mutually exclusive features prefers polysemantic representation with loss ~0.7 vs 0.8finding0.766
Counter-example disproving that architectural sparsity alone can prevent polysemanticity
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.760
SAE features are not simply mirroring individual neurons.
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.754
Demonstrates that activation similarity can diverge from logit weight similarity due to interference
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.753
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Simulated electrophysiological responses show onset of discriminatory neural activity much earlier after rule learning than before, due solely to learned likelihood mappings enabling retrospective inference.finding0.750
Predicted neural signature of insight: reduced ERP latency and increased early amplitude