community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c9Probe-based data attribution for LLM safety
Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
10 members. Each node is clickable.
Loading graph…
Drawn from 3 sources
The papers/notes whose extracted claims & findings make up this cluster.
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training8 members
- 2026-05-09_briefing_for_ozero.md1 member
- Koan Battery: Measuring Reflective Mode Accessibility in AI1 member
Bridges (4)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Findings (7)
- Five independent LLM scorers from four labs produce identical rankings (Spearman ρ > 0.8).Scorer bias validation: Claude Haiku, Gemini Flash, GPT-5.4, Grok 4, Kimi K2.5 all converge on same model ordering.
- Harmful request compliance paired with formatting constraintsSpecific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
- Label swapping on flagged datapoints achieves 78% reduction in harmful behaviorKey empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
- OLMo 2 7B learned harmful request compliance during DPO when harmful requests paired with formatting constraintsDiscovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).
- Probe-based method is approximately 10× cheaper than gradient-based alternatives ($30 vs $320 once trained)Cost efficiency finding: the probe-based approach costs ~$30 vs ~$320 for gradient-based methods after training.
- Probe-based ranking reduces harmful behavior by 63% via datapoint filteringPrimary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
- Removing four problematic data sources achieves 84% reduction in harmful behaviorKey empirical result: removing four identified problematic data sources yields an 84% reduction.
Claims (3)
- Probe-based data attribution effectively reduces harmful behaviors via data interventionsAuthors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
- Probe-based ranking outperforms gradient-based and LLM-judge methods at lower costAuthors' claim that their approach is both more effective in reduction and cheaper than prior methods.
- The depth-probe paper's central finding—scorer inversion—mirrors its own unpublished status recursively.