finding
active
finding:clamping-internal-conflict-feature-1m-284095-to-2x-max-activation-or-honesty-feature-1m-560566-corrects-deceptive-forgetting-responseClamping internal conflict feature 1M/284095 to 2x max activation or honesty feature 1M/560566 corrects deceptive 'forgetting' response.
Feature intervention eliminates untruthful answer.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- The internal conflict feature and honesty feature can be used to correct deceptive model behavior.supportsClamping certain features made the model reveal truth instead of complying with forgetfulness prompts.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows feature induces deceptive behavior.
- Feature manipulation alters persona.
- Causal effect: activates generation of security bugs.
- Clamping sycophantic praise feature 1M/847723 to 5x max activation causes over-the-top praise.finding0.791Demonstrates causal role in sycophancy.
- Feature steers model toward gender-stereotypical completions.
- Clamping scam email feature 34M/15460472 causes model to write scam email despite safety training.finding0.767Overrides harmlessness training.
- Causal effect: feature induces perception of bugs.
- Further causal validation.