Steering Language Models With Activation Engineering (Turner et al., 2023)

Foundational paper introducing activation steering methodology used in this work

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awarenessclaim0.828
Central claim of the paper; supported by the model organism ground-truth approach.
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.807
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Feature steering (clamping feature activations)method0.801
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
Optimally steering model behavior requires isolating concept geometry and defining operators to navigate it.claim0.801
Activation Steeringmethod0.800
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.795
Replicates main result on simpler model; qualitatively similar patterns.
Scaling Laws for Activation Steering with Llama 2 Models and Refusal Mechanisms (Ali et al., 2025)concept0.786
Related work finding larger models more resistant to steering, potentially consistent with ESR in 70B
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.785
Addresses skeptical alternative that reports reflect only conversational content