finding

active

finding:concept-steering-with-target-vs-off-target-probe-area-metric-reveals-three-operational-regimes-selectively-steerable-encoded-but-entangled-non-encoded-across-sleepfm-reve-labram

Concept steering with target vs off-target probe area metric reveals three operational regimes (selectively steerable, encoded but entangled, non-encoded) across SleepFM, REVE, LaBraM.

Result categorizing concept steerability into three distinct regimes.

Source paper

extracted_from

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Claims (1)

claim

Clinical concepts in EEG foundation models fall into three operational regimes: selectively steerable, encoded but entangled, and non-encoded.
supports
Interpretive claim summarizing the spectrum of concept steerability discovered.

Communities (3)

community

Manifold-aware concept steering in neural representations
members_of
Explores geometry of activation/behavior manifolds to enable selective, non-destructive concept interventions.
Geometric concept representations in neural networks
members_of
Concepts encoded as curved manifolds and circular structures in LLM activation spaces.
Concept entanglement in biomedical foundation models
members_of
Investigates inseparability of clinical concepts (age, pathology) in EEG transformers using SAE feature analysis and steering metrics across SleepFM, REVE, LaBraM architectures.

Questions (1)

question

How are clinical concepts represented and steerable in EEG foundation models?
answered_by
Core research question driving the mechanistic investigation.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The target vs. off-target probe area metric quantifies steering selectivity and distinguishes selectively steerable from entangled interventions.claim0.886
Justification for the novel metric introduced in the paper
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.812
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Concept steering experiments identify three distinct operational regimes across clinical concepts in EEG foundation models.finding0.806
Main empirical finding of the concept steering analysis
Target vs. Off-Target Probe Area Metricmethod0.793
Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.
Optimally steering model behavior requires isolating concept geometry and defining operators to navigate it.claim0.780
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.775
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Feature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.finding0.774
Empirical comparison showing advantage of SAE features in low-data regime.
Monosemanticity and entanglement of SAE features were benchmarked for clinical taxonomy grounding across SleepFM, REVE, LaBraM.finding0.772
Quantitative assessment of feature quality using clinical concepts across models.