finding
active
finding:label-swapping-on-flagged-datapoints-achieves-78-reduction-in-harmful-behaviorLabel swapping on flagged datapoints achieves 78% reduction in harmful behavior
Key empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
Source paper
extracted_from(2026) · Frank Xiao · Santiago Aranguri
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
- Label swapping for harm reductionmembers_ofFlagging and relabeling datapoints to reduce harmful model behavior by ~78%.
Methods (1)
method
- Probe-Based Data Attributionanswered_byLinear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Questions (1)
question
- Which training datapoints caused a specific undesired behavior to emerge during post-training?answered_byCore research question driving the probe-based data attribution method.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key empirical result: removing four identified problematic data sources yields an 84% reduction.
- Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
- Shows NLA explanations capture latent model beliefs about rewards before output selection; validates interpretability.
- Main quantitative result demonstrating effectiveness of activation capping
- Jailbreaking reveals training data biases but does not reveal an entity with its own agendaclaim0.678Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
- Comparative claim between the two steering strategies
- Contrasts with helpful-only RL where reasoning increases; shows setting-dependent RL dynamics
- Establishes the severity of persona-based jailbreaks that the Assistant Axis can mitigate