concept
active
concept:looping-behavior-under-high-steering-strengthlooping behavior under high steering strength
Observed pattern where models produce repetitive outputs (e.g., 'I am going to die' repeatedly) under high-strength SAE feature steering
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Kimi K2.5associated_withOne of the two primary target models studied for emotion feature persistence and self-evaluation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
- Practical finding for optimizing steering setup.
- Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
- The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
- The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.721Replicates main result on simpler model; qualitatively similar patterns.