method
active
method:target-vs-off-target-probe-area-metricTarget vs. Off-Target Probe Area Metric
Metric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Concept SteeringimplementsLatent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
Claims (1)
claim
- Justification for the novel metric introduced in the paper
Events (1)
event
- Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Result categorizing concept steerability into three distinct regimes.
- Unintended changes in model behavior when performing edits; compared between VPD editing and fine-tuning.
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
- Open methodological question acknowledged as limitation
- Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
- Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.