paper
referenced-only
paper:arditi-refusal-in-language-models-is-mediated-b-2024

Refusal in language models is mediated by a single direction

External IDs

title_hash
9741c30e865d2031c07b5e8d4715ad3721f73224
legacy_slug
arditi-refusal-in-language-models-is-mediated-b-2024
Frontmatter (8 fields)
{
  "doi": null,
  "year": 2024,
  "title": "Refusal in language models is mediated by a single direction",
  "venue": "Advances in Neural Information Processing Systems",
  "authors": [
    "Andy Arditi",
    "Oscar Obeso",
    "Aaquib Syed",
    "Daniel Paleka",
    "Nina Panickssery",
    "Wes Gurnee",
    "Neel Nanda"
  ],
  "arxiv_id": null,
  "s2_paper_id": null,
  "ingest_status": "referenced-only"
}

Outgoing (0)

None.

Incoming (0)

None.