claim
active
claim:some-sae-concept-steering-interventions-act-as-wrecking-balls-that-collapse-global-model-performance-rather-than-selectively-modifying-target-conceptsSome SAE concept steering interventions act as 'wrecking balls' that collapse global model performance rather than selectively modifying target concepts.
A critical failure mode identified in the paper demonstrating risk of naïve concept steering
Source paper
extracted_from(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9
Neighborhood — ranked by edge-count
Findings (2)
finding
- Observation of catastrophic performance drop when steering certain concepts.
- Demonstrates a critical failure mode of concept steering with clinical safety implications
Communities (3)
community
- Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
- Concepts encoded as curved manifolds and circular structures in LLM activation spaces.
- Studies how targeted interventions on learned concepts can cause sudden, global collapse in neural network performance.
Concepts (1)
concept
- Representational FailuresupportsA failure mode exposed by the SAE framework where model representations are entangled or collapse under intervention
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Can concept steering interventions on EEG foundation models be made selective rather than globally destructive?question0.806Research question motivating the introduction of the probe area metric and identification of operational regimes
- Load-bearing phrase describing catastrophic steering effects.
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Central claim of the paper; supported by the model organism ground-truth approach.
- Addresses skeptical alternative that reports reflect only conversational content
- If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.761Speculative claim about scaling introspective access to general SAE feature interpretation
- Comparative claim between the two steering strategies