finding
active
finding:for-four-example-features-golden-gate-bridge-brain-sciences-monuments-transit-infrastructure-all-strong-activations-top-bucket-received-specificity-rating-3-from-claude-3-opusFor four example features (Golden Gate Bridge, brain sciences, monuments, transit infrastructure), all strong activations (top bucket) received specificity rating 3 from Claude 3 Opus.
Validation that top activations are highly specific to interpretation.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Claude 3 Opus ratings aligned with human judgment of feature descriptions.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Strong causal evidence that the feature represents the bridge.
- Further causal validation.
- Demonstrates multilingual generalization of SAE features.
- Key finding about the relationship between capability and introspection.
- Example of geometric clustering of features.
- Quantitative comparison supporting SAE utility.
- Quantitative relationship between concept frequency and feature presence.
- Core evidence that model withholds pro-animal-welfare responses during training