claim
active
claim:features-can-be-used-to-steer-large-modelsFeatures can be used to steer large models.
Clamping feature activations causally alters model behavior in interpretable ways.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (3)
finding
- Feature manipulation alters persona.
- Strong causal evidence that the feature represents the bridge.
- Shows feature induces deceptive behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Motivating question for the paper, addressed by scaling SAEs to Claude 3 Sonnet.
- The method can steer the model in both positive and negative directions on the target semantic.
- Features may not be strictly one-dimensional objects; higher-dimensional feature manifolds may exist in model representationshypothesis0.768Extension of superposition hypothesis to account for continuous families of features
- Scaling SAE size increases granularity and discovers new features.
- Observation about asymmetry in base model capabilities.
- can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.762Question about practical safety application of feature monitoring.
- Empirical comparison showing advantage of SAE features in low-data regime.