community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c6Natural Language Autoencoders for mechanistic interpretability
Using NLAs to extract human-readable explanations of model internals via unsupervised reconstruction, revealing steering vectors, confabulation patterns, and causal belief capture.
15 members. Each node is clickable.
Loading graph…
Drawn from 1 source
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (4)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Findings (10)
- Automated auditing benchmark requiring end-to-end investigation of intentionally-misaligned model; NLA-equipped agents outperform baselines.Downstream task validating NLA utility for model auditing; agents succeed without access to misalignment training data.
- Editing NLA explanations to change 'reward' to 'penalty' produces steering vector that increases odd-number responses from near-zero to >70%, demonstrating belief capture upstream of behavior.Shows NLA explanations capture latent model beliefs about rewards before output selection; validates interpretability.
- Five prediction tasks improve with NLA training across three models (Opus 4.6, Haiku 4.5, Haiku 3.5).Systematic evidence that NLA explanations become more informative during training despite optimizing only for reconstruction.
- Language switching caused by malformed training data—model fixates on spurious cues inferring user's non-native status, detected via NLA representations preceding foreign-language output.Case study demonstrating NLA ability to surface root causes of model misbehavior; corroborated by training data inspection.
- Little evidence of steganography between AV and AR; meaning-preserving transformations cause only small FVE drops.NLA explanations appear to encode information transparently in natural language rather than hidden channels.
- Little evidence of steganography in NLAs; meaning-preserving transformations cause only small drops in FVEQuantitative evaluation showing NLAs do not heavily rely on covert encoding beyond overt language.
- Meaning-preserving transformations (paraphrasing, translating to French, shuffling) cause only small drops in FVE.Evidence that NLAs do not encode hidden information in overt text structure; explanations are primarily semantic.
- NLA explanations grow more informative over training with FVE increasing from 0.3-0.4 to 0.6-0.8 roughly linearly in log(training steps)Quantitative evidence that NLA training produces increasingly informative explanations despite optimizing only for reconstruction.
- NLA-derived steering vectors from edited explanations can causally shift planning representations, changing rhyme completion from 'rabbit' to 'mouse' at ~50% success rate.Evidence that NLA explanations bear causal relationship to model outputs; demonstrates validity of extracted representations.
- NLA-equipped auditing agents outperform baselines on misalignment investigation task.Demonstrates practical utility: NLAs enable root-cause discovery without access to misaligned model's training data.
Claims (5)
- Natural Language Autoencoders achieve readable explanations through unsupervised reconstruction loss optimized with reinforcement learning, not explicit interpretability constraints.Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.
- NLA explanations can contain claims about the target model's input context that are verifiably false, but are typically thematically faithful to the context.Key limitation identified: NLAs hallucinate specific details while preserving thematic accuracy; informs practical usage.
- NLA explanations confabulate false specifics but maintain thematic fidelity; claims repeated across tokens more likely true than isolated claims.Core limitation and usage heuristic: read NLAs for themes rather than individual factual claims; cross-check with original context.
- NLAs bridge unsupervised concept-discovery methods (e.g., SAEs) and supervised activation-verbalization methods (e.g., activation oracles)The paper positions NLAs as combining unsupervised learning with direct readability.
- While NLA claims can be false in specifics, they are typically thematically faithful to contextKey insight about confabulation patterns in NLAs enabling practical use.