paper
referenced-only
paper:arditi-refusal-in-language-models-is-mediated-b-2024Refusal in language models is mediated by a single direction
External IDs
title_hash
9741c30e865d2031c07b5e8d4715ad3721f73224legacy_slug
arditi-refusal-in-language-models-is-mediated-b-2024Frontmatter (8 fields)
{
"doi": null,
"year": 2024,
"title": "Refusal in language models is mediated by a single direction",
"venue": "Advances in Neural Information Processing Systems",
"authors": [
"Andy Arditi",
"Oscar Obeso",
"Aaquib Syed",
"Daniel Paleka",
"Nina Panickssery",
"Wes Gurnee",
"Neel Nanda"
],
"arxiv_id": null,
"s2_paper_id": null,
"ingest_status": "referenced-only"
}Outgoing (0)
None.
Incoming (0)
None.