finding
active
finding:clamping-code-error-feature-to-large-negative-activation-causes-model-to-output-correct-result-despite-bug-in-code-and-in-one-case-rewrite-code-without-bugClamping code error feature to large negative activation causes model to output correct result despite bug in code, and in one case rewrite code without bug.
Suppressing the feature makes the model ignore bugs.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Causal effect: feature induces perception of bugs.
- Causal effect: activates generation of security bugs.
- noted as a possible confound
- Feature manipulation alters persona.
- Causal effect showing the feature governs computation.
- Shows a general code error detector beyond simple typo detection.
- discussion of potential confounds
- Acknowledges the confound of not explicitly instructing models to track wealth, yet points to reasoning gaps given code agents avoid errors without prompts.