finding
active
finding:clamping-addition-feature-active-on-non-addition-code-tricks-the-model-into-believing-it-has-been-asked-to-execute-additionClamping addition feature active on non-addition code tricks the model into believing it has been asked to execute addition.
Causal effect showing the feature governs computation.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Suppressing the feature makes the model ignore bugs.
- Causal effect: feature induces perception of bugs.
- Feature manipulation alters persona.
- Causal effect: activates generation of security bugs.
- Feature represents the 'addition' function abstractly.
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
- Feature intervention eliminates untruthful answer.
- The specific computational question the paper resolves empirically