concept
active
concept:scaling-laws-for-activation-steering-with-llama-2-models-and-refusal-mechanisms-ali-et-al-2025Scaling Laws for Activation Steering with Llama 2 Models and Refusal Mechanisms (Ali et al., 2025)
Related work finding larger models more resistant to steering, potentially consistent with ESR in 70B
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Foundational paper introducing activation steering methodology used in this work
- Illustrative finding that ESR mitigates but does not fully eliminate steering influence
- Core empirical result demonstrating that manifold steering produces on-target, behavior-aligned outputs.
- Empirical result demonstrating the failure mode of linear steering when concept geometry is cyclic.
- Key empirical result showing that optimizing for behavioral outputs and fitting representation geometry produce the same path in activation space.
- Model-specific difference in persona susceptibility
- Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.