Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

/Users/antonborzov/Documents/Research.nosync/papers/natural.md

External IDs

title_hash

89aac446e48a3cbd7e5636a7e6738502b3f0d716

legacy_slug

natural

Frontmatter (12 fields)

{
  "doi": null,
  "year": null,
  "title": "Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations",
  "authors": [],
  "abstract": "We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer that maps an activation to a text description and an activation reconstructor that maps the description back to an activation. We jointly train these modules with reinforcement learning to reconstruct residual stream activations, and although optimized for reconstruction, the resulting explanations read as plausible interpretations of model internals. We apply NLAs to model auditing and demonstrate their utility in diagnosing safety-relevant behaviors and surfacing unverbalized model behaviors.",
  "arxiv_id": null,
  "enrichment": {
    "is_stale": true
  },
  "pdf_status": "available",
  "openalex_id": null,
  "source_file": "natural.md",
  "ingested_via": "ingest_one_url",
  "fulltext_path": "/Users/antonborzov/Documents/Research.nosync/papers/natural.md"
}

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

External IDs

Outgoing (6)

introduces (3)

mentions (3)

Incoming (0)

Mentions (1)