concept

active

concept:scaling-laws-for-activation-steering-with-llama-2-models-and-refusal-mechanisms-ali-et-al-2025

Scaling Laws for Activation Steering with Llama 2 Models and Refusal Mechanisms (Ali et al., 2025)

Related work finding larger models more resistant to steering, potentially consistent with ESR in 70B

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.786
Foundational paper introducing activation steering methodology used in this work
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.784
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
Steering Llama-3.1 8B along the circular representation manifold produces outputs that follow the natural circle of the behavior manifold, cleanly shifting probability mass from Monday through successive days.finding0.779
Core empirical result demonstrating that manifold steering produces on-target, behavior-aligned outputs.
Linear steering on Llama-3.1 8B for the days-of-week task cuts across the behavior manifold, producing noisy off-target effects where predicted tokens are not even days of the week.finding0.777
Empirical result demonstrating the failure mode of linear steering when concept geometry is cyclic.
The representation-based path and the behavior-based path in Llama-3.1 8B activation space trace out similar curves, demonstrating bidirectional geometry alignment.finding0.776
Key empirical result showing that optimizing for behavioral outputs and fitting representation geometry produce the same path in activation space.
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.772
Model-specific difference in persona susceptibility
Deceptive capabilities may scale with model size (inverse scaling law hypothesis)hypothesis0.772
Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.771
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.