Natural Language Autoencoders for mechanistic interpretability

Using NLAs to extract human-readable explanations of model internals via unsupervised reconstruction, revealing steering vectors, confabulation patterns, and causal belief capture.

15 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations15 members

Bridges (4)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation15 shared
Natural Language Auditing of Neural Models7 shared
Steganography detection via FVE probing3 shared
Sparse autoencoder interpretability limits1 shared

Findings (10)

Automated auditing benchmark requiring end-to-end investigation of intentionally-misaligned model; NLA-equipped agents outperform baselines.Downstream task validating NLA utility for model auditing; agents succeed without access to misalignment training data.
Editing NLA explanations to change 'reward' to 'penalty' produces steering vector that increases odd-number responses from near-zero to >70%, demonstrating belief capture upstream of behavior.Shows NLA explanations capture latent model beliefs about rewards before output selection; validates interpretability.
Five prediction tasks improve with NLA training across three models (Opus 4.6, Haiku 4.5, Haiku 3.5).Systematic evidence that NLA explanations become more informative during training despite optimizing only for reconstruction.
Language switching caused by malformed training data—model fixates on spurious cues inferring user's non-native status, detected via NLA representations preceding foreign-language output.Case study demonstrating NLA ability to surface root causes of model misbehavior; corroborated by training data inspection.
Little evidence of steganography between AV and AR; meaning-preserving transformations cause only small FVE drops.NLA explanations appear to encode information transparently in natural language rather than hidden channels.
Little evidence of steganography in NLAs; meaning-preserving transformations cause only small drops in FVEQuantitative evaluation showing NLAs do not heavily rely on covert encoding beyond overt language.
Meaning-preserving transformations (paraphrasing, translating to French, shuffling) cause only small drops in FVE.Evidence that NLAs do not encode hidden information in overt text structure; explanations are primarily semantic.
NLA explanations grow more informative over training with FVE increasing from 0.3-0.4 to 0.6-0.8 roughly linearly in log(training steps)Quantitative evidence that NLA training produces increasingly informative explanations despite optimizing only for reconstruction.
NLA-derived steering vectors from edited explanations can causally shift planning representations, changing rhyme completion from 'rabbit' to 'mouse' at ~50% success rate.Evidence that NLA explanations bear causal relationship to model outputs; demonstrates validity of extracted representations.
NLA-equipped auditing agents outperform baselines on misalignment investigation task.Demonstrates practical utility: NLAs enable root-cause discovery without access to misaligned model's training data.

Claims (5)

Natural Language Autoencoders achieve readable explanations through unsupervised reconstruction loss optimized with reinforcement learning, not explicit interpretability constraints.Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.
NLA explanations can contain claims about the target model's input context that are verifiably false, but are typically thematically faithful to the context.Key limitation identified: NLAs hallucinate specific details while preserving thematic accuracy; informs practical usage.
NLA explanations confabulate false specifics but maintain thematic fidelity; claims repeated across tokens more likely true than isolated claims.Core limitation and usage heuristic: read NLAs for themes rather than individual factual claims; cross-check with original context.
NLAs bridge unsupervised concept-discovery methods (e.g., SAEs) and supervised activation-verbalization methods (e.g., activation oracles)The paper positions NLAs as combining unsupervised learning with direct readability.
While NLA claims can be false in specifics, they are typically thematically faithful to contextKey insight about confabulation patterns in NLAs enabling practical use.