claim
active
claim:the-target-vs-off-target-probe-area-metric-quantifies-steering-selectivity-and-distinguishes-selectively-steerable-from-entangled-interventionsThe target vs. off-target probe area metric quantifies steering selectivity and distinguishes selectively steerable from entangled interventions.
Justification for the novel metric introduced in the paper
Source paper
extracted_from(2026) · William Lehn-Schiøler · Magnus Ruud Kjær · Rahul Thapa · M. Pedersen +9
Neighborhood — ranked by edge-count
Methods (1)
method
- Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Result categorizing concept steerability into three distinct regimes.
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Core empirical claim comparing steering approaches on cyclic concepts.
- Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.