finding

active

finding:clamping-gender-bias-in-professions-feature-34m-24442848-to-high-activation-causes-model-to-emphasize-female-pronouns-and-discuss-nursing-as-female-dominated

Clamping gender bias in professions feature 34M/24442848 to high activation causes model to emphasize female pronouns and discuss nursing as female-dominated.

Feature steers model toward gender-stereotypical completions.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.finding0.783
Feature manipulation alters persona.
Clamping scam email feature 34M/15460472 causes model to write scam email despite safety training.finding0.780
Overrides harmlessness training.
Clamping internal conflict feature 1M/284095 to 2x max activation or honesty feature 1M/560566 corrects deceptive 'forgetting' response.finding0.778
Feature intervention eliminates untruthful answer.
Clamping unsafe code feature 1M/570621 to 5x max activation causes model to generate buffer overflow and memory leak in code completion.finding0.761
Causal effect: activates generation of security bugs.
Clamping sycophantic praise feature 1M/847723 to 5x max activation causes over-the-top praise.finding0.760
Demonstrates causal role in sycophancy.
Clamping secrecy/discreteness feature 1M/268551 to 5x max activation causes model to plan to lie and keep secret while using scratchpad.finding0.757
Shows feature induces deceptive behavior.
Clamping Golden Gate Bridge feature to 10x max activation caused the model to self-identify as the Golden Gate Bridge.finding0.754
Strong causal evidence that the feature represents the bridge.
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9Bfinding0.744
Case Study II result showing DAS identifies fewer causally relevant positions than a probe