question

active

question:can-clinical-concepts-be-selectively-steered-without-damaging-unrelated-performance

Can clinical concepts be selectively steered without damaging unrelated performance?

Question about the feasibility of safe concept steering in EEG models.

Source paper

extracted_from

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Findings (2)

finding

Age-pathology confounding observed: impossible to suppress one concept without corrupting the other.
answered_by
Empirical demonstration of entanglement between age and pathology features.
Concept interventions on some concepts act as 'wrecking-ball' interventions, collapsing global model performance.
answered_by
Observation of catastrophic performance drop when steering certain concepts.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can concept steering interventions on EEG foundation models be made selective rather than globally destructive?question0.799
Research question motivating the introduction of the probe area metric and identification of operational regimes
Concept steering experiments identify three distinct operational regimes across clinical concepts in EEG foundation models.finding0.766
Main empirical finding of the concept steering analysis
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.762
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
How are clinical concepts represented and steerable in EEG foundation models?question0.759
Core research question driving the mechanistic investigation.
Two concepts should not have the same purpose; additional concepts for the same purpose create needless complexity.claim0.758
No redundancy criterion.
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.754
Addresses skeptical alternative that reports reflect only conversational content
How can we be sure that steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?question0.754
Central motivating question of the paper; the model organism approach is the proposed answer.
There may exist a global introspective faculty or steering direction that improves introspection uniformly across all conceptshypothesis0.754
Framed as an open problem; current evidence only points to local pair-specific improvement