claim
active
claim:some-sae-concept-steering-interventions-act-as-wrecking-balls-that-collapse-global-model-performance-rather-than-selectively-modifying-target-concepts

Some SAE concept steering interventions act as 'wrecking balls' that collapse global model performance rather than selectively modifying target concepts.

A critical failure mode identified in the paper demonstrating risk of naïve concept steering

Source paper

extracted_from
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Findings (2)

finding

Communities (3)

community

Concepts (1)

concept
  • A failure mode exposed by the SAE framework where model representations are entangled or collapse under intervention

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.