paper
active
paper:xiao-aranguri-probe-data-attribution-2026

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

/Users/antonborzov/Documents/Research.nosync/papers/xiao-aranguri-probe-data-attribution-2026.md

External IDs

title_hash
0fd34d8d61d6a1e2aef02537310944cf3fedeb18
legacy_slug
xiao-aranguri-probe-data-attribution-2026
title_hash
149137685702c006ff1c1c13a4135e8d5ea9d157
Frontmatter (19 fields)
{
  "doi": "10.48550/arxiv.2602.11079",
  "pdf": "https://arxiv.org/pdf/2602.11079",
  "url": "https://arxiv.org/abs/2602.11079",
  "tags": [
    "data-attribution",
    "probes",
    "post-training",
    "DPO",
    "alignment",
    "applied-research",
    "goodfire"
  ],
  "year": 2026,
  "saved": "2026-05-14",
  "title": "Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training",
  "venue": "arXiv preprint",
  "status": "full-text-saved",
  "authors": [
    "Frank Xiao",
    "Santiago Aranguri"
  ],
  "landing": "https://www.goodfire.ai/research/probe-based-data-attribution",
  "arxiv_id": 2602.11079,
  "published": "2026-04-29",
  "affiliation": "Caltech (SPAR) + Goodfire",
  "openalex_id": "W7128690423",
  "openalex_year": 2026,
  "openalex_enriched_at": 1778977721,
  "openalex_match_title": "Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training",
  "openalex_cited_by_count": 0
}

References (29)

Mentions (2)

  • papers
    /Users/antonborzov/Documents/Research.nosync/papers/xiao-aranguri-probe-data-attribution-2026.md
  • papers
    xiao-aranguri-probe-data-attribution-2026.md