finding

active

finding:higher-activating-feature-intervals-are-systematically-more-interpretable-than-lower-activating-intervals-in-human-analysis

Higher-activating feature intervals are systematically more interpretable than lower-activating intervals in human analysis

Shows interpretability correlates with activation strength, most model effect comes from high activations

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
supports
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightclaim0.782
Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
Higher reflection frequency correlates with lower accuracy partly because more reflections are generated on difficult questionsclaim0.774
Author's interpretation of the negative correlation between reflection rate and accuracy observed in Fig. 5
Our SAEs' features are more interpretable than neurons.claim0.769
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.769
Quantitative comparison supporting SAE utility.
Lower (more central) PCs of emotion feature activations are more persistent than higher-rank (noisier) PCs in both Kimi and Cogito, above variance-matched baselines.finding0.769
Supports that persistence is genuinely tied to emotion structure rather than measurement artifact
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.765
SAEs uncover safety-relevant representations that might be monitored or controlled.
The likelihood of a dedicated feature for a concept (element, city, animal, food) follows a sigmoid in log-frequency of the concept in training data, with threshold frequency inversely proportional to number of alive features.finding0.762
Quantitative relationship between concept frequency and feature presence.
Automated interpretability using LLMs can usefully score feature specificity.claim0.760
Claude 3 Opus ratings aligned with human judgment of feature descriptions.