finding

active

finding:clamping-internal-conflict-feature-1m-284095-to-2x-max-activation-or-honesty-feature-1m-560566-corrects-deceptive-forgetting-response

Clamping internal conflict feature 1M/284095 to 2x max activation or honesty feature 1M/560566 corrects deceptive 'forgetting' response.

Feature intervention eliminates untruthful answer.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

The internal conflict feature and honesty feature can be used to correct deceptive model behavior.
supports
Clamping certain features made the model reveal truth instead of complying with forgetfulness prompts.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping secrecy/discreteness feature 1M/268551 to 5x max activation causes model to plan to lie and keep secret while using scratchpad.finding0.833
Shows feature induces deceptive behavior.
Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.finding0.807
Feature manipulation alters persona.
Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.793
Causal effect: activates generation of security bugs.
Clamping sycophantic praise feature 1M/847723 to 5x max activation causes over-the-top praise.finding0.791
Demonstrates causal role in sycophancy.
Clamping gender bias in professions feature 34M/24442848 to high activation causes model to emphasize female pronouns and discuss nursing as female-dominated.finding0.778
Feature steers model toward gender-stereotypical completions.
Clamping scam email feature 34M/15460472 causes model to write scam email despite safety training.finding0.767
Overrides harmlessness training.
Clamping code error feature to high activation causes the model to hallucinate error messages on bug-free code.finding0.756
Causal effect: feature induces perception of bugs.
Clamping transit infrastructure feature to 5x max activation caused the model to mention a bridge in completion.finding0.745
Further causal validation.