claim

active

claim:features-can-be-used-to-steer-large-models

Features can be used to steer large models.

Clamping feature activations causally alters model behavior in interpretable ways.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Findings (3)

finding

Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.
supports
Feature manipulation alters persona.
Clamping Golden Gate Bridge feature to 10x max activation caused the model to self-identify as the Golden Gate Bridge.
supports
Strong causal evidence that the feature represents the bridge.
Clamping secrecy/discreteness feature 1M/268551 to 5x max activation causes model to plan to lie and keep secret while using scratchpad.
supports
Shows feature induces deceptive behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

will these methods work for large models?question0.783
Motivating question for the paper, addressed by scaling SAEs to Claude 3 Sonnet.
Our method enables bidirectional steering of model behavior.finding0.776
The method can steer the model in both positive and negative directions on the target semantic.
Features may not be strictly one-dimensional objects; higher-dimensional feature manifolds may exist in model representationshypothesis0.768
Extension of superposition hypothesis to account for continuous families of features
Larger SAEs contain features for concepts not captured in smaller SAEs, indicating improved coverage.claim0.767
Scaling SAE size increases granularity and discovers new features.
Base models are good modellers of worlds but not of their own state, because they lack a developed self-model initially.claim0.763
Observation about asymmetry in base model capabilities.
Optimally steering model behavior requires isolating concept geometry and defining operators to navigate it.claim0.762
can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.762
Question about practical safety application of feature monitoring.
Feature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.finding0.759
Empirical comparison showing advantage of SAE features in low-data regime.