finding

active

finding:for-four-example-features-golden-gate-bridge-brain-sciences-monuments-transit-infrastructure-all-strong-activations-top-bucket-received-specificity-rating-3-from-claude-3-opus

For four example features (Golden Gate Bridge, brain sciences, monuments, transit infrastructure), all strong activations (top bucket) received specificity rating 3 from Claude 3 Opus.

Validation that top activations are highly specific to interpretation.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

Automated interpretability using LLMs can usefully score feature specificity.
supports
Claude 3 Opus ratings aligned with human judgment of feature descriptions.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping Golden Gate Bridge feature to 10x max activation caused the model to self-identify as the Golden Gate Bridge.finding0.799
Strong causal evidence that the feature represents the bridge.
Clamping transit infrastructure feature to 5x max activation caused the model to mention a bridge in completion.finding0.775
Further causal validation.
Golden Gate Bridge feature [34M/31164353] fires strongly on Wikipedia snippets in Chinese, Japanese, Korean, Russian, Vietnamese, Greek.finding0.763
Demonstrates multilingual generalization of SAE features.
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.755
Key finding about the relationship between capability and introspection.
Golden Gate Bridge feature neighborhood includes Alcatraz, Presidio, Lake Tahoe, Yosemite; decoder cosine similarity maps onto semantic relatedness.finding0.750
Example of geometric clustering of features.
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.745
Quantitative comparison supporting SAE utility.
The likelihood of a dedicated feature for a concept (element, city, animal, food) follows a sigmoid in log-frequency of the concept in training data, with threshold frequency inversely proportional to number of alive features.finding0.743
Quantitative relationship between concept frequency and feature presence.
Claude 3 Opus shows 15% compliance gap (51% supports welfare in training vs 66% when unmonitored) in prompted animal welfare settingfinding0.742
Core evidence that model withholds pro-animal-welfare responses during training