claim

active

claim:the-target-vs-off-target-probe-area-metric-quantifies-steering-selectivity-and-distinguishes-selectively-steerable-from-entangled-interventions

The target vs. off-target probe area metric quantifies steering selectivity and distinguishes selectively steerable from entangled interventions.

Justification for the novel metric introduced in the paper

Source paper

extracted_from

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9

Neighborhood — ranked by edge-count

Methods (1)

method

Target vs. Off-Target Probe Area Metric
supports
Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Concept steering with target vs off-target probe area metric reveals three operational regimes (selectively steerable, encoded but entangled, non-encoded) across SleepFM, REVE, LaBraM.finding0.886
Result categorizing concept steerability into three distinct regimes.
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.778
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Focus→wellbeing steering: both probe entropy (1.09→1.67 bits) and report entropy (0.88→1.69 bits) increase monotonically with αfinding0.778
Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.775
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Linear steering produces noisy off-target effects; manifold steering cleanly shifts probability mass between sequential concepts.finding0.772
Core empirical claim comparing steering approaches on cyclic concepts.
Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.claim0.771
Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
manifold steering produces clean probability shifts along natural behavior structure; linear steering cuts across manifold and produces off-target noisy effectsfinding0.769
Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.764
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.