AI sandbagging: Language models can strategically underperform on evaluations

External IDs

title_hash

8b5b479ccf482db75f7adeef6879393a3b87c9f5

legacy_slug

t-ai-sandbagging-language-models-can-strat-2025

Frontmatter (8 fields)

{
  "doi": null,
  "year": 2025,
  "title": "AI sandbagging: Language models can strategically underperform on evaluations",
  "venue": "International Conference on Learning Representations",
  "authors": [
    "van der Weij, T.",
    "Hofstätter, F.",
    "Jaffe, O.",
    "Brown, S. F.",
    "Ward, F. R."
  ],
  "arxiv_id": null,
  "s2_paper_id": null,
  "ingest_status": "referenced-only"
}

Outgoing (0)

None.

Incoming (0)

None.

Cited by (1)

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training