finding
active
finding:clamping-code-error-feature-to-high-activation-causes-the-model-to-hallucinate-error-messages-on-bug-free-codeClamping code error feature to high activation causes the model to hallucinate error messages on bug-free code.
Causal effect: feature induces perception of bugs.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Suppressing the feature makes the model ignore bugs.
- Causal effect: activates generation of security bugs.
- Feature manipulation alters persona.
- Causal effect showing the feature governs computation.
- noted as a possible confound
- Clamping scam email feature 34M/15460472 causes model to write scam email despite safety training.finding0.759Overrides harmlessness training.
- Further causal validation.
- Feature intervention eliminates untruthful answer.