Target vs. Off-Target Probe Area Metric

Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.

Neighborhood — ranked by edge-count

paper

method

Concept Steering
implements
Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.

claim

event

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders (2026)
introduces
Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Concept steering with target vs off-target probe area metric reveals three operational regimes (selectively steerable, encoded but entangled, non-encoded) across SleepFM, REVE, LaBraM.finding0.793
Result categorizing concept steerability into three distinct regimes.
Off-target effectsconcept0.737
Unintended changes in model behavior when performing edits; compared between VPD editing and fine-tuning.
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.737
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Probe-Based Data Attributionmethod0.736
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Probe-based data attribution for alignmentconcept0.734
What is the appropriate metric for measuring representational alignment, given active debate on merits and deficiencies of all proposed measures?question0.726
Open methodological question acknowledged as limitation
Probe scoreconcept0.724
Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
Probesconcept0.723
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.