claim

active

claim:some-sae-concept-steering-interventions-act-as-wrecking-balls-that-collapse-global-model-performance-rather-than-selectively-modifying-target-concepts

Some SAE concept steering interventions act as 'wrecking balls' that collapse global model performance rather than selectively modifying target concepts.

A critical failure mode identified in the paper demonstrating risk of naïve concept steering

Source paper

extracted_from

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Findings (2)

finding

Concept interventions on some concepts act as 'wrecking-ball' interventions, collapsing global model performance.
supports
Observation of catastrophic performance drop when steering certain concepts.
Wrecking-ball interventions that collapse global model performance are empirically identified in EEG foundation models.
supports
Demonstrates a critical failure mode of concept steering with clinical safety implications

Communities (3)

community

Manifold-aware concept steering in neural representations
members_of
Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
Geometric concept representations in neural networks
members_of
Concepts encoded as curved manifolds and circular structures in LLM activation spaces.
Concept steering & catastrophic model failure
members_of
Studies how targeted interventions on learned concepts can cause sudden, global collapse in neural network performance.

Concepts (1)

concept

Representational Failure
supports
A failure mode exposed by the SAE framework where model representations are entangled or collapse under intervention

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can concept steering interventions on EEG foundation models be made selective rather than globally destructive?question0.806
Research question motivating the introduction of the probe area metric and identification of operational regimes
wrecking-ball interventions that collapse global model performancequote0.800
Load-bearing phrase describing catastrophic steering effects.
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.791
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.790
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awarenessclaim0.773
Central claim of the paper; supported by the model organism ground-truth approach.
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.764
Addresses skeptical alternative that reports reflect only conversational content
If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.761
Speculative claim about scaling introspective access to general SAE feature interpretation
Stepwise steering preserves accuracy while reducing cost, whereas all-token steering causes significant degradation at large intervention strengthsclaim0.757
Comparative claim between the two steering strategies