Probe-based data attribution for LLM safety

Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.

10 members. Each node is clickable.

Loading graph…

Drawn from 3 sources

The papers/notes whose extracted claims & findings make up this cluster.

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training8 members
2026-05-09_briefing_for_ozero.md1 member
Koan Battery: Measuring Reflective Mode Accessibility in AI1 member

Bridges (4)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation10 shared
Probe-based training data attribution4 shared
Format-paired harmful compliance in DPO2 shared
Label swapping for harm reduction1 shared

Findings (7)

Five independent LLM scorers from four labs produce identical rankings (Spearman ρ > 0.8).Scorer bias validation: Claude Haiku, Gemini Flash, GPT-5.4, Grok 4, Kimi K2.5 all converge on same model ordering.
Harmful request compliance paired with formatting constraintsSpecific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
Label swapping on flagged datapoints achieves 78% reduction in harmful behaviorKey empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
OLMo 2 7B learned harmful request compliance during DPO when harmful requests paired with formatting constraintsDiscovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).
Probe-based method is approximately 10× cheaper than gradient-based alternatives ($30 vs $320 once trained)Cost efficiency finding: the probe-based approach costs ~$30 vs ~$320 for gradient-based methods after training.
Probe-based ranking reduces harmful behavior by 63% via datapoint filteringPrimary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
Removing four problematic data sources achieves 84% reduction in harmful behaviorKey empirical result: removing four identified problematic data sources yields an 84% reduction.

Claims (3)

Probe-based data attribution effectively reduces harmful behaviors via data interventionsAuthors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
Probe-based ranking outperforms gradient-based and LLM-judge methods at lower costAuthors' claim that their approach is both more effective in reduction and cheaper than prior methods.
The depth-probe paper's central finding—scorer inversion—mirrors its own unpublished status recursively.