finding

active

finding:ali-et-al-2025-found-contrastive-activation-addition-less-effective-at-larger-model-scale-consistent-with-esr-in-70b

Ali et al. 2025 found contrastive activation addition less effective at larger model scale, consistent with ESR in 70B

Prior finding from related work that aligns with ESR being strongest in the largest model tested

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Claims (1)

claim

Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inference
supports
Central interpretive claim of the paper supported by causal ablation and activation evidence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Our method achieves superior performance compared to Contrastive Activation Addition.finding0.821
Performance gains over CAA in steering tasks.
Random latent ablation produces slight increase in ESR rate (3.8% to 4.2%), not statistically significantfinding0.786
Control result confirming OTD ablation effect is specific to those latents, not a general ablation artifact
ESR exhibits non-monotonic relationship with boost level, peaking around -0.3σ below threshold in Llama-3.3-70Bfinding0.773
Characterizes the narrow operating window in which ESR can manifest
The 26 differentially-activated OTD latents play a causally important role in enabling ESR in Llama-3.3-70Bclaim0.764
Causal interpretation of the ablation experiment results
The 25% reduction in multi-attempt rate from OTD ablation suggests additional mechanisms contribute to ESR beyond the identified latentsclaim0.761
Acknowledges incompleteness of the causal account, suggesting redundant circuits or nonlinear interactions
Stress-sharing increases radius of cellular influence: ~30 units avg (step 1) vs ~5 units (non-sharing); lasts 85 steps vs 10 steps.finding0.758
Features in A/1 have median activation correlation of 0.72 with most similar feature in B/1; neurons have median 0.46finding0.758
Systematic comparison showing features are substantially more universal than neurons across models
Contrastive Activation Addition (CAA)method0.758
An existing activation steering method used as comparative baseline.