finding
active
finding:clamping-golden-gate-bridge-feature-to-10x-max-activation-caused-the-model-to-self-identify-as-the-golden-gate-bridgeClamping Golden Gate Bridge feature to 10x max activation caused the model to self-identify as the Golden Gate Bridge.
Strong causal evidence that the feature represents the bridge.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Clamping feature activations causally alters model behavior in interpretable ways.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Further causal validation.
- Validation that top activations are highly specific to interpretation.
- Demonstrates multilingual generalization of SAE features.
- Feature manipulation alters persona.
- Causal effect: activates generation of security bugs.
- Causal effect: feature induces perception of bugs.
- Feature steers model toward gender-stereotypical completions.
- Feature intervention eliminates untruthful answer.