finding

active

finding:feature-steering-was-effective-in-5-out-of-7-cases-where-few-shot-probe-steering-vectors-failed-to-produce-meaningful-behavior-change

Feature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.

Empirical comparison showing advantage of SAE features in low-data regime.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Claims (1)

claim

Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.
supports
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.801
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.801
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.798
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.787
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.785
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.784
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.783
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Few-shot linear probe steering baselinemethod0.780
Constructing steering vectors from the difference of mean activations on positive and negative examples, for comparison.