paper
active
paper:xiao-aranguri-probe-data-attribution-2026Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
/Users/antonborzov/Documents/Research.nosync/papers/xiao-aranguri-probe-data-attribution-2026.mdExternal IDs
arxiv
2602.11079title_hash
0fd34d8d61d6a1e2aef02537310944cf3fedeb18legacy_slug
xiao-aranguri-probe-data-attribution-2026title_hash
149137685702c006ff1c1c13a4135e8d5ea9d157Frontmatter (19 fields)
{
"doi": "10.48550/arxiv.2602.11079",
"pdf": "https://arxiv.org/pdf/2602.11079",
"url": "https://arxiv.org/abs/2602.11079",
"tags": [
"data-attribution",
"probes",
"post-training",
"DPO",
"alignment",
"applied-research",
"goodfire"
],
"year": 2026,
"saved": "2026-05-14",
"title": "Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training",
"venue": "arXiv preprint",
"status": "full-text-saved",
"authors": [
"Frank Xiao",
"Santiago Aranguri"
],
"landing": "https://www.goodfire.ai/research/probe-based-data-attribution",
"arxiv_id": 2602.11079,
"published": "2026-04-29",
"affiliation": "Caltech (SPAR) + Goodfire",
"openalex_id": "W7128690423",
"openalex_year": 2026,
"openalex_enriched_at": 1778977721,
"openalex_match_title": "Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training",
"openalex_cited_by_count": 0
}Outgoing (5)
Cites (3)
- Data Attribution(concept)
- Distractor-Triggered Compliance(concept)
- OLMo 2(concept)
Implements (1)
Member of (1)
- Neural Steering Methods(community)
Incoming (5)
Associated with (2)
Authored by (2)
- Frank Xiao(thinker)
- Santiago Aranguri(thinker)
Implemented by (1)
- Goodfire AI research collective(concept)
References (29)
- Datamodels: Understanding predictions with data and data with predictionsreferenced-only
- Jailbroken: How does LLM safety training fail?referenced-only
- Refusal in language models is mediated by a single directionreferenced-only
- Training verifiers to solve math word problemsreferenced-only
- Toy models of superpositionreferenced-only
- Constitutional AI: Harmlessness from AI feedbackreferenced-only
- Steering language models with activation engineeringreferenced-only
- LMSYS-Chat-1M: A large-scale real-world LLM conversation datasetreferenced-only
- Representation engineering: A top-down approach to AI transparencyreferenced-only
- Instruction-following evaluation for large language modelsreferenced-only
- Sleeper agents: Training deceptive LLMs that persist through safety trainingreferenced-only
- OLMo 2 furiousreferenced-only
- Persona vectors: Monitoring and controlling character traits in language modelsreferenced-only
- Narrow finetuning leaves clearly readable traces in activation differencesreferenced-only
- Olmo 3referenced-only
- Discovering language model behaviors with model-written evaluationsreferenced-only
- Estimating training data influence by tracing gradient descentreferenced-only
- Hierarchical grouping to optimize an objective functionreferenced-only
- A coefficient of agreement for nominal scalesreferenced-only
- LoRA: Low-rank adaptation of large language modelsreferenced-only
- The linear representation hypothesis and the geometry of large language modelsreferenced-only
- Training language models to follow instructions with human feedbackreferenced-only
- LESS: Selecting influential data for targeted instruction tuningreferenced-only
- TRAK: Attributing model behavior at scalereferenced-only
- XSTest: A test suite for identifying exaggerated safety behaviours in large language modelsreferenced-only
- Direct preference optimization: Your language model is secretly a reward modelreferenced-only
- AI sandbagging: Language models can strategically underperform on evaluationsreferenced-only
- OLMo: Open Language Modelreferenced-only
- Understanding black-box predictions via influence functionsreferenced-only
Mentions (2)
- papers
/Users/antonborzov/Documents/Research.nosync/papers/xiao-aranguri-probe-data-attribution-2026.md - papers
xiao-aranguri-probe-data-attribution-2026.md