community

active

leiden_hybrid_concepts

label: sonnet

community:leiden_hybrid_concepts-run2-c23

Natural Language Auditing of Neural Models

NLA explanations used as steering vectors and auditing tools to investigate model beliefs and misalignment.

10 members. Each node is clickable.

Loading graph…

Drawn from 4 sources

The papers/notes whose extracted claims & findings make up this cluster.

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations7 members
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds3 members
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds2 members
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds1 member

Bridges (4)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation7 shared
Natural Language Autoencoders for mechanistic interpretability7 shared
Active inference & agent ecology3 shared
Empirical operationalization of agency attribution3 shared

Claims (6)

Attribution of agency is an empirical question, not a philosophical one.
NLA explanations can contain claims about the target model's input context that are verifiably false, but are typically thematically faithful to the context.Key limitation identified: NLAs hallucinate specific details while preserving thematic accuracy; informs practical usage.
NLA explanations confabulate false specifics but maintain thematic fidelity; claims repeated across tokens more likely true than isolated claims.Core limitation and usage heuristic: read NLAs for themes rather than individual factual claims; cross-check with original context.
The correct level of agency for a system is an empirical question determined by which intervention strategy provides the most efficient prediction and control.Central methodological claim of TAME: optimal position on the persuadability continuum is found through experiments, not philosophical definition.
The correct level of agency is an empirical question, not a philosophical one.We must determine the optimal stance by testing what predictive/control strategies work best.
While NLA claims can be false in specifics, they are typically thematically faithful to contextKey insight about confabulation patterns in NLAs enabling practical use.

Findings (4)

Automated auditing benchmark requiring end-to-end investigation of intentionally-misaligned model; NLA-equipped agents outperform baselines.Downstream task validating NLA utility for model auditing; agents succeed without access to misalignment training data.
Editing NLA explanations to change 'reward' to 'penalty' produces steering vector that increases odd-number responses from near-zero to >70%, demonstrating belief capture upstream of behavior.Shows NLA explanations capture latent model beliefs about rewards before output selection; validates interpretability.
NLA-derived steering vectors from edited explanations can causally shift planning representations, changing rhyme completion from 'rabbit' to 'mouse' at ~50% success rate.Evidence that NLA explanations bear causal relationship to model outputs; demonstrates validity of extracted representations.
NLA-equipped auditing agents outperform baselines on misalignment investigation task.Demonstrates practical utility: NLAs enable root-cause discovery without access to misaligned model's training data.