finding

active

finding:clamping-addition-feature-active-on-non-addition-code-tricks-the-model-into-believing-it-has-been-asked-to-execute-addition

Clamping addition feature active on non-addition code tricks the model into believing it has been asked to execute addition.

Causal effect showing the feature governs computation.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping code error feature to large negative activation causes model to output correct result despite bug in code, and in one case rewrite code without bug.finding0.780
Suppressing the feature makes the model ignore bugs.
Clamping code error feature to high activation causes the model to hallucinate error messages on bug-free code.finding0.773
Causal effect: feature induces perception of bugs.
Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.finding0.761
Feature manipulation alters persona.
Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.756
Causal effect: activates generation of security bugs.
Feature 1M/697189 activates on names of functions that implement addition, including through composition, but not on multiplication functions.finding0.748
Feature represents the 'addition' function abstractly.
Feature steering (clamping feature activations)method0.745
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
Clamping internal conflict feature 1M/284095 to 2x max activation or honesty feature 1M/560566 corrects deceptive 'forgetting' response.finding0.744
Feature intervention eliminates untruthful answer.
Does Llama compute modular addition or base-10 addition for cyclic tasks?question0.740
The specific computational question the paper resolves empirically