method
pending-review
method:natural-language-autoencoders-nlasNatural Language Autoencoders (NLAs)
natural.mdFrontmatter (10 fields)
{
"doc": "natural.md",
"context": "Core unsupervised method for generating natural language explanations of LLM activations through a verbalizer-reconstructor pair trained with RL.",
"category": "ai",
"norm_label": "Natural Language Autoencoders (NLAs)",
"graphify_id": "natural_language_autoencoders",
"source_file": "natural.md",
"imported_from": "/Users/antonborzov/Documents/Research.nosync/papers/extract_typed_out/natural/graph.json",
"extracted_type": "method",
"source_location": "§Introduction",
"graphify_file_type": "method"
}Outgoing (12)
about (2)
- Neuronpedia NLA Frontend(artifact)
- NLA Training Code Repository(artifact)
Associated with (3)
- Activation Oracles (AO)(method)
- Sparse Autoencoders (SAE)(method)
- Sparse Autoencoders (SAEs)(method)
Cites (1)
- Transformer Circuits(venue)
Extends (1)
- Activation Oracles (AO)(method)
Implements (5)
- Activation Reconstructor (AR)(method)
- Activation Verbalizer (AV)(method)
- Claude Haiku 3.5(dataset)
- Claude Haiku 4.5(dataset)
- Claude Opus 4.6(dataset)
Incoming (12)
about (4)
- Claude Haiku 3.5(dataset)
- Claude Haiku 4.5(dataset)
- Fraction of Variance Explained (FVE)(method)
- Natural Language Autoencoder Training Code and Models(artifact)
Associated with (1)
- Logit Lens(method)
Contradicted by (1)
Supported by (6)
- Automated auditing benchmark requiring end-to-end investigation of intentionally-misaligned model; NLA-equipped agents outperform baselines.(finding)
- Editing NLA explanations to change 'reward' to 'penalty' produces steering vector that increases odd-number responses from near-zero to >70%, demonstrating belief capture upstream of behavior.(finding)
- Language switching caused by malformed training data—model fixates on spurious cues inferring user's non-native status, detected via NLA representations preceding foreign-language output.(finding)
- Model precomputes answers before tool invocation and attends to cached answer over tool output when discrepancy exists, confirmed via attribution graphs.(finding)
- NLA-derived steering vectors from edited explanations can causally shift planning representations, changing rhyme completion from 'rabbit' to 'mouse' at ~50% success rate.(finding)
- NLA-equipped auditing agents outperform baselines on misalignment investigation task.(finding)
Mentions (1)
- papers-typed
natural.md