looping behavior under high steering strength

Observed pattern where models produce repetitive outputs (e.g., 'I am going to die' repeatedly) under high-strength SAE feature steering

Neighborhood — ranked by edge-count

Concepts (1)

concept

Kimi K2.5
associated_with
One of the two primary target models studied for emotion feature persistence and self-evaluation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppressionfinding0.757
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
Distributing steering strength across multiple layers (6 layers at 0.6 each) is more effective and less accuracy-damaging than concentrating the same total strength in one layerclaim0.738
Practical finding for optimizing steering setup.
Steering's effect on verbalized evaluation/deployment beliefs in chain-of-thought is highly correlated with its effect on type hint rate across hyperparameter configurationsfinding0.735
Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
Endogenous Steering Resistanceconcept0.729
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.claim0.728
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.725
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
How Do Homeostatic Loops Scale And Pivot Acrossquestion0.722
Activation steering works on SDF-only model organism (before expert iteration) with steering strength 0.4finding0.721
Replicates main result on simpler model; qualitatively similar patterns.