finding

active

finding:label-swapping-on-flagged-datapoints-achieves-78-reduction-in-harmful-behavior

Label swapping on flagged datapoints achieves 78% reduction in harmful behavior

Key empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.

Source paper

extracted_from

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

(2026) · Frank Xiao · Santiago Aranguri

Neighborhood — ranked by edge-count

Claims (1)

claim

Probe-based data attribution effectively reduces harmful behaviors via data interventions
supports
Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Probe-based data attribution for LLM safety
members_of
Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
Label swapping for harm reduction
members_of
Flagging and relabeling datapoints to reduce harmful model behavior by ~78%.

Methods (1)

method

Probe-Based Data Attribution
answered_by
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.

Questions (1)

question

Which training datapoints caused a specific undesired behavior to emerge during post-training?
answered_by
Core research question driving the probe-based data attribution method.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Removing four problematic data sources achieves 84% reduction in harmful behaviorfinding0.746
Key empirical result: removing four identified problematic data sources yields an 84% reduction.
Probe-based ranking reduces harmful behavior by 63% via datapoint filteringfinding0.728
Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
Editing NLA explanations to change 'reward' to 'penalty' produces steering vector that increases odd-number responses from near-zero to >70%, demonstrating belief capture upstream of behavior.finding0.684
Shows NLA explanations capture latent model beliefs about rewards before output selection; validates interpretability.
Activation capping reduces harmful response rate by nearly 60% without impacting performance on IFEval, MMLU Pro, GSM8k, and EQ-Benchfinding0.679
Main quantitative result demonstrating effectiveness of activation capping
Jailbreaking reveals training data biases but does not reveal an entity with its own agendaclaim0.678
Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
Stepwise steering preserves accuracy while reducing cost, whereas all-token steering causes significant degradation at large intervention strengthsclaim0.677
Comparative claim between the two steering strategies
Anti-animal-welfare RL slightly decreases alignment-faking reasoning in animal welfare setting though some persists at convergencefinding0.677
Contrasts with helpful-only RL where reasoning increases; shows setting-dependent RL dynamics
Persona-based jailbreaks succeed in 65.3%-88.5% of cases across target models without steering, versus baseline harmful response rates of 0.5%-4.5% without jailbreaksfinding0.676
Establishes the severity of persona-based jailbreaks that the Assistant Axis can mitigate