finding

active

finding:clamping-code-error-feature-to-large-negative-activation-causes-model-to-output-correct-result-despite-bug-in-code-and-in-one-case-rewrite-code-without-bug

Clamping code error feature to large negative activation causes model to output correct result despite bug in code, and in one case rewrite code without bug.

Suppressing the feature makes the model ignore bugs.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping code error feature to high activation causes the model to hallucinate error messages on bug-free code.finding0.875
Causal effect: feature induces perception of bugs.
Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.804
Causal effect: activates generation of security bugs.
Some failures may reflect prompt design rather than model limitations, though code agents avoid errors without promptsclaim0.797
noted as a possible confound
Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.finding0.780
Feature manipulation alters persona.
Clamping addition feature active on non-addition code tricks the model into believing it has been asked to execute addition.finding0.780
Causal effect showing the feature governs computation.
Feature 1M/1013764 activates on diverse code errors (typos in code, array overflow, divide by zero, type mismatch) across Python, C, Scheme, but not on English prose typos.finding0.745
Shows a general code error detector beyond simple typo detection.
Code agents operate on structured data with exact arithmetic, while LLMs must parse natural-language observations and track state across turns; some failures may partly reflect numerical parsing or working-memory limitationsclaim0.743
discussion of potential confounds
Some failures may reflect prompt design rather than model limitations, but the underlying issue is one of reasoning rather than instruction-following.claim0.743
Acknowledges the confound of not explicitly instructing models to track wealth, yet points to reasoning gaps given code agents avoid errors without prompts.