paper
active
paper:natural

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

/Users/antonborzov/Documents/Research.nosync/papers/natural.md

External IDs

title_hash
89aac446e48a3cbd7e5636a7e6738502b3f0d716
legacy_slug
natural
Frontmatter (12 fields)
{
  "doi": null,
  "year": null,
  "title": "Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations",
  "authors": [],
  "abstract": "We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer that maps an activation to a text description and an activation reconstructor that maps the description back to an activation. We jointly train these modules with reinforcement learning to reconstruct residual stream activations, and although optimized for reconstruction, the resulting explanations read as plausible interpretations of model internals. We apply NLAs to model auditing and demonstrate their utility in diagnosing safety-relevant behaviors and surfacing unverbalized model behaviors.",
  "arxiv_id": null,
  "enrichment": {
    "is_stale": true
  },
  "pdf_status": "available",
  "openalex_id": null,
  "source_file": "natural.md",
  "ingested_via": "ingest_one_url",
  "fulltext_path": "/Users/antonborzov/Documents/Research.nosync/papers/natural.md"
}

Mentions (1)

  • papers-typed
    natural.md