paper
active
paper:naturalNatural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
/Users/antonborzov/Documents/Research.nosync/papers/natural.mdExternal IDs
title_hash
89aac446e48a3cbd7e5636a7e6738502b3f0d716legacy_slug
naturalFrontmatter (12 fields)
{
"doi": null,
"year": null,
"title": "Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations",
"authors": [],
"abstract": "We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer that maps an activation to a text description and an activation reconstructor that maps the description back to an activation. We jointly train these modules with reinforcement learning to reconstruct residual stream activations, and although optimized for reconstruction, the resulting explanations read as plausible interpretations of model internals. We apply NLAs to model auditing and demonstrate their utility in diagnosing safety-relevant behaviors and surfacing unverbalized model behaviors.",
"arxiv_id": null,
"enrichment": {
"is_stale": true
},
"pdf_status": "available",
"openalex_id": null,
"source_file": "natural.md",
"ingested_via": "ingest_one_url",
"fulltext_path": "/Users/antonborzov/Documents/Research.nosync/papers/natural.md"
}Outgoing (6)
introduces (3)
mentions (3)
- Claude Haiku 3.5(dataset)
- Claude Haiku 4.5(dataset)
- Claude Opus 4.6(dataset)
Incoming (0)
None.
Mentions (1)
- papers-typed
natural.md