concept
active
concept:steering-language-models-with-activation-engineering-turner-et-al-2023Steering Language Models With Activation Engineering (Turner et al., 2023)
Foundational paper introducing activation steering methodology used in this work
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central claim of the paper; supported by the model organism ground-truth approach.
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
- Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.795Replicates main result on simpler model; qualitatively similar patterns.
- Scaling Laws for Activation Steering with Llama 2 Models and Refusal Mechanisms (Ali et al., 2025)concept0.786Related work finding larger models more resistant to steering, potentially consistent with ESR in 70B
- Addresses skeptical alternative that reports reflect only conversational content