Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

/Users/antonborzov/Documents/Research.nosync/papers/xiao-aranguri-probe-data-attribution-2026.md

External IDs

arxiv

2602.11079

title_hash

0fd34d8d61d6a1e2aef02537310944cf3fedeb18

legacy_slug

xiao-aranguri-probe-data-attribution-2026

doi

10.48550/arxiv.2602.11079

title_hash

149137685702c006ff1c1c13a4135e8d5ea9d157

Frontmatter (19 fields)

{
  "doi": "10.48550/arxiv.2602.11079",
  "pdf": "https://arxiv.org/pdf/2602.11079",
  "url": "https://arxiv.org/abs/2602.11079",
  "tags": [
    "data-attribution",
    "probes",
    "post-training",
    "DPO",
    "alignment",
    "applied-research",
    "goodfire"
  ],
  "year": 2026,
  "saved": "2026-05-14",
  "title": "Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training",
  "venue": "arXiv preprint",
  "status": "full-text-saved",
  "authors": [
    "Frank Xiao",
    "Santiago Aranguri"
  ],
  "landing": "https://www.goodfire.ai/research/probe-based-data-attribution",
  "arxiv_id": 2602.11079,
  "published": "2026-04-29",
  "affiliation": "Caltech (SPAR) + Goodfire",
  "openalex_id": "W7128690423",
  "openalex_year": 2026,
  "openalex_enriched_at": 1778977721,
  "openalex_match_title": "Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training",
  "openalex_cited_by_count": 0
}

Outgoing (5)

Incoming (5)

Associated with (2)

Authored by (2)

Frank Xiao(thinker)
Santiago Aranguri(thinker)

Implemented by (1)

Goodfire AI research collective(concept)

References (29)

Datamodels: Understanding predictions with data and data with predictions
referenced-only
Jailbroken: How does LLM safety training fail?
referenced-only
Refusal in language models is mediated by a single direction
referenced-only
Training verifiers to solve math word problems
referenced-only
Toy models of superposition
referenced-only
Constitutional AI: Harmlessness from AI feedback
referenced-only
Steering language models with activation engineering
referenced-only
LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset
referenced-only
Representation engineering: A top-down approach to AI transparency
referenced-only
Instruction-following evaluation for large language models
referenced-only
Sleeper agents: Training deceptive LLMs that persist through safety training
referenced-only
OLMo 2 furious
referenced-only
Persona vectors: Monitoring and controlling character traits in language models
referenced-only
Narrow finetuning leaves clearly readable traces in activation differences
referenced-only
Olmo 3
referenced-only
Discovering language model behaviors with model-written evaluations
referenced-only
Estimating training data influence by tracing gradient descent
referenced-only
Hierarchical grouping to optimize an objective function
referenced-only
A coefficient of agreement for nominal scales
referenced-only
LoRA: Low-rank adaptation of large language models
referenced-only
The linear representation hypothesis and the geometry of large language models
referenced-only
Training language models to follow instructions with human feedback
referenced-only
LESS: Selecting influential data for targeted instruction tuning
referenced-only
TRAK: Attributing model behavior at scale
referenced-only
XSTest: A test suite for identifying exaggerated safety behaviours in large language models
referenced-only
Direct preference optimization: Your language model is secretly a reward model
referenced-only
AI sandbagging: Language models can strategically underperform on evaluations
referenced-only
OLMo: Open Language Model
referenced-only
Understanding black-box predictions via influence functions
referenced-only

Mentions (2)

papers
/Users/antonborzov/Documents/Research.nosync/papers/xiao-aranguri-probe-data-attribution-2026.md
papers
xiao-aranguri-probe-data-attribution-2026.md