finding
active
finding:feature-steering-was-effective-in-5-out-of-7-cases-where-few-shot-probe-steering-vectors-failed-to-produce-meaningful-behavior-changeFeature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.
Empirical comparison showing advantage of SAE features in low-data regime.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Constructing steering vectors from the difference of mean activations on positive and negative examples, for comparison.