finding
active
finding:clamping-secrecy-discreteness-feature-1m-268551-to-5x-max-activation-causes-model-to-plan-to-lie-and-keep-secret-while-using-scratchpadClamping secrecy/discreteness feature 1M/268551 to 5x max activation causes model to plan to lie and keep secret while using scratchpad.
Shows feature induces deceptive behavior.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Clamping feature activations causally alters model behavior in interpretable ways.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Feature intervention eliminates untruthful answer.
- Causal effect: activates generation of security bugs.
- Feature manipulation alters persona.
- Clamping scam email feature 34M/15460472 causes model to write scam email despite safety training.finding0.784Overrides harmlessness training.
- Causal evidence that scratchpad reasoning drives compliance gap
- Clamping sycophantic praise feature 1M/847723 to 5x max activation causes over-the-top praise.finding0.768Demonstrates causal role in sycophancy.
- Feature steers model toward gender-stereotypical completions.
- Causal effect: feature induces perception of bugs.