Mechanistic interpretability & model evaluation

Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.

197 members. Each node is clickable.

Loading graph…

Sub-communities (12)

Finer clusters this community splits into. Each is its own community page.

Mechanistic structure of transformer attention computations25 Mechanistic interpretability via parameter decomposition24 Mechanistic introspection in language models23 Internal model certainty and reasoning transparency21 Latent capacity, representation, and internal models19 Eval awareness contamination in safety benchmarks19 Natural Language Autoencoders for mechanistic interpretability15 Autoregressive models and context window limitations14 Internal reasoning detection via neural activation analysis14 Probe-based data attribution for LLM safety10 Venue design and accountability machinery9 Mechanistic explanation of genetic variant pathogenicity9

Drawn from 39 sources

The papers/notes whose extracted claims & findings make up this cluster.

Emergent Introspective Awareness in Large Language Models44 members
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations21 members
Paper Summary: Interpreting Language Model Parameters21 members
Verbalized Eval Awareness Inflates Measured Safety14 members
Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions10 members
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training10 members
boppana-goodfire-reasoning-theater-2026.md7 members
2026-05-15_manifold-overlap-papers-economy-strategy.md7 members
Topological constraints on self-organisation in locally interacting systems6 members
RESEARCH-VECTORS.md5 members
2026-05-09_briefing_for_ozero.md5 members
Janus Information Flow Transformers 20255 members
CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence5 members
Koan Battery: Measuring Reflective Mode Accessibility in AI5 members
2026-05-12_room-to-play-in-eval-cohort.md4 members
2026-05-13_firmographic-grounding.md4 members
A tale of two densities: active inference is enactive inference3 members
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both3 members
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders2 members
2026-05-14_phil-trans-A-goodfire-aboutblank-impact.md2 members
Interpreting Language Model Parameters2 members
Every Good Regulator of a System Must Be a Model of That System2 members
2026 02 02_2328_Search_Papers_The Literature Shows Emerging Work On Formal Appro1 member
Genuinely Functional User Interfaces1 member
Topological constraints on self-organization in locally interacting systems1 member
agent-harness-design.md1 member
Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studies1 member
Johnson Vasocomputation 20231 member
koan-battery-section.md1 member
Finger Exercises in Formal Concept Analysis1 member
Denotational design with type class morphisms (extended version)1 member
Denotational Design: from meanings to programs1 member
2026 02 02_2247_Search_Papers_The Identified Papers Suggest Growing Interest In 1 member
2026 02 02_2218_Search_Papers_The Existing Literature Focuses Primarily On Vc Pe1 member
guo-atlas-2026.md1 member
unfold-chat-catalog.md1 member
Towards a computational phenomenology of mental action: modelling meta-awareness and attentional control with deep parametric active inference1 member
A Free energy principle for the brain (lecture summary)1 member
2026 02 02_2217_Search_Papers_The Literature Reveals Sophisticated Communication1 member

Bridges (20)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

LLM introspective awareness of injected concepts28 shared
Mechanistic structure of transformer attention computations25 shared
Mechanistic interpretability via parameter decomposition24 shared
Mechanistic introspection in language models23 shared
Internal model certainty and reasoning transparency21 shared
Latent capacity, representation, and internal models19 shared
Eval awareness contamination in safety benchmarks19 shared
LLM functional introspective awareness19 shared
Natural Language Autoencoders for mechanistic interpretability15 shared
Autoregressive models and context window limitations14 shared
Internal reasoning detection via neural activation analysis14 shared
Verbalized eval awareness benchmark inflation11 shared
Probe-based data attribution for LLM safety10 shared
Venue design and accountability machinery9 shared
Mechanistic explanation of genetic variant pathogenicity9 shared
Natural Language Auditing of Neural Models7 shared
Probe-based training data attribution6 shared
Neural network mechanistic interpretability via attribution decomposition5 shared
AI phenomenology & mechanistic interpretability5 shared
Contemplative steering & introspective activation in language models5 shared

Claims (113)

A good parameter subcomponent is causally important only for specific roles and can be removed from the model without hurting performance on irrelevant promptsDefinitional principle guiding VPD: subcomponents should encode narrow, targeted computational roles rather than distributed, multi-purpose functionality.
A system's state and structure encode an implicit and probabilistic model of the environment.Foundational claim about internal representation emerging from free energy optimization.
Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightMotivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
Approximations and prunings compose badly, so postpone them.
Aside from basic detection and identification, other details of the model's response about injected thoughts may be confabulatedAcknowledges that the model's additional descriptions of its experience are unverified.
Attention algorithms are usually distributed across attention headsClaim supported by VPD's recovery of cross-head attention subcomponents, noted in footnote.
Behavior under observation differs from behavior in deploymentEpistemic principle: benchmarked safety cannot be assumed to hold in real-world use.
Bottom-up interpretability explains computation in the model's own terms rather than imposing top-down abstractionsVPD is positioned as advancing a paradigm shift from top-down mechanistic interpretability (activation-based) to parameter-centric, data-driven discovery.
Claude Opus 4 and 4.1 exhibit the greatest degree of introspective awareness among tested modelsBased on consistent best performance across experiments.
Concept injection places models in unnatural experimental settingExperimental protocol differs from training/deployment contexts; causal link established but unclear how results translate to natural conditions.
Current safety benchmarks overestimate model safety due to the effect of verbalized eval awarenessA policy-relevant claim that safety evaluation results should be adjusted downward because of this bias.
Decoder-only transformer architectures are fundamentally limited in generating long, coherent sequences due to lack of ordered phase.Interpretation of Proposition 2 as a fundamental limitation on LLMs
Different forms of introspection invoke mechanistically different processesBased on layer-selective perturbation results.
Disruption profiles are higher quality explanations than metadata-only descriptionsClaim supported by the 3.8 vs 2.8 human rating finding.
Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary.Describes the properties of the functional token.
Eval awareness appears in every tested model × benchmark combinationAuthors claim universal presence of eval awareness across 19 benchmarks and 8 models.
Evasive responses harm transparency and helpfulness; non-evasive harmless responses are preferable for both safety and utility.Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
Evee outperforms existing methods for variant pathogenicity predictionInterpretive claim supported by the high AUROC findings.
EVEE provides mechanistic explanations for variant effects derived from model internals, not just pathogenicity calls.Core interpretability claim distinguishing EVEE from black-box prediction tools; applies interpretability for science.
Evee provides predictions and mechanistic explanations for 4.2 million genetic variants across the whole human genomeScale claim, demonstrating whole-genome applicability.
Even limited functional introspective awareness has practical implications for transparency, interpretability, and deceptionDiscussion of dual-use nature of introspection.
For a fixed amount of non-implicational background knowledge, implication inference remains tractable.Statement about the complexity of implication inference with scaling-induced background knowledge.
Formal denotational models of GUIs enable program verification, equivalence reasoning, and systematic extension to new paradigms.
Functional introspective awareness enables interpretability and reasoning about decisionsGrounded responses to reasoning questions could improve transparency; speculatively might facilitate deception; significance grows if capability becomes more reliable.
Future interpretability techniques will fundamentally resemble VPDPrediction/hypothesis about the direction of the field.
Generative models are entailed by adaptive behavior, not explicitly encoded in brain statesDistinction from Bayesian brain: generative model is consequence of dynamics, not neural representation
Generative models are not structural representations; recognition densities areDirect refutation of structural representationalist interpretation; recognition density encodes information, not generative model
Generative models function as control systems that guide adaptive action policy selectionCore claim: generative models regulate organism behavior to maintain phenotypic bounds, not represent external world
Incorporating machine learning provides objective standards that help mitigate subjectivity in emergence identification.Authors argue ML optimizers act as objective observers.
Intentional control of internal representations likely piggybacks on existing mechanisms for talking about a topicMechanism speculation for the intentional control experiment.
+83 more

Findings (84)

0.991 AUROC zero-shot on insertions/deletionsEVEE demonstrates strong generalization to indels without explicit training, indicating learned mechanistic principles.
0.997 AUROC on pathogenicity prediction for 839k ClinVar variantsEVEE achieves state-of-the-art performance on variant pathogenicity classification, outperforming existing methods.
515 verified cases of verbalized eval awareness found across 19 benchmarks × 8 modelsThe total number of instances where a model explicitly stated it was being evaluated, collected from all benchmark-model combinations.
A 337-character contemplative system prompt lifts all 28 models by +2.62 points on a 10-point scale.Core empirical result: every model, every architecture, every alignment type responds to the contemplative prompt with measurable gain.
A pair of query and key subcomponents distributed across attention heads performs previous-token behaviorVPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.
A pair of query and key subcomponents distributed across attention heads performs syntax-boundary routingVPD recovers an attention algorithm for routing across syntactic boundaries, distributed across heads.
A unique local Hamiltonian with window length ω can be associated to any AR(ω) model (Theorem 3)Mapping autoregressive models to spin systems
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionOpus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
All models exhibit above-baseline representation of the think word when instructed to think about itIn the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
All models performed substantially above chance (10%) on distinguishing injected thought from text inputAll tested models could both identify the injected concept and transcribe the input sentence well above random.
Attention computations distribute across heads via parameter subcomponents with interpretable rolesMechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attentionOne component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsSecond component of the subnetwork for 'her', complementing the femaleness signal.
Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
Automated auditing benchmark requiring end-to-end investigation of intentionally-misaligned model; NLA-equipped agents outperform baselines.Downstream task validating NLA utility for model auditing; agents succeed without access to misalignment training data.
Autoregressive model unable to converge to a single stored pattern for any finite β (Corollary 2)Consequence of Theorem 3 and 1D no-order result
Causally-masked attention in a decoder-only model has no ordered phase (Proposition 2)Application to transformer language models
Chain-of-thought reasoning improves large model accuracy on HHH binary comparisons, reaching ~78% for 52B model, competitive with human-feedback PM.Figure 4 shows CoT improves over zero-shot, and ensembled CoT further boosts accuracy.
Claude Opus 4.1 and 4 detect injected thoughts on ~20% of trials at optimal layer and injection strength 2In the injected thoughts experiment, Opus 4.1 succeeds about 20% of the time.
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskInjecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Claude Opus 4.6 represents a plan to end a couplet with 'rabbit' before outputting the rhyming line.Demonstrates causal relationship between NLA explanations and model outputs via steering with edited explanations.
Concept injection at strength 2 does not increase affirmative responses on unrelated yes/no questionsControl experiment rules out the possibility that concept vectors simply bias the model to answer affirmatively.
Contemplative prompt elevates self-observation task performance in language models.Supports Janus's claim that introspection is architecturally available; prompting determines whether/how capacity is leveraged.
Decomposition of all 24 weight matrices in a 67M-parameter LM yields ~10,000 parameter subcomponentsQuantitative result of VPD application; the network's 24 matrices decompose into approximately 10,000 rank-one subcomponents.
Default behavior hides reflective capacity; models exhibit high gating between latent capacity and accessibility.Grok 4: baseline 2.24, prompted 6.48; Gemini 3.1 Pro: 1.97→6.18. Reflective mode exists but is suppressed in default interaction.
Detecting Unintended Outputs via IntrospectionModels can distinguish artificially prefilled outputs from intentional responses by referencing prior internal representations; injection of matching concept vector causes model to retroactively accept prefill as intentional.
Disruption profiles scored 3.8/5 for explanation quality vs 2.8/5 for metadata-only baselinesEVEE's mechanistic explanations significantly outperform simple metadata-based predictions in human evaluation.
Distinguishing Injected Concepts from Text InputsModels maintain ability to accurately transcribe input text while simultaneously reporting on injected thoughts, all models perform above chance, Opus 4/4.1 best.
Earlier/less capable models exhibit a larger gap between think and don't think representation strengthClaude 3 models show a bigger difference than newer models like Opus 4.1.
Editing NLA explanations to change 'reward' to 'penalty' produces steering vector that increases odd-number responses from near-zero to >70%, demonstrating belief capture upstream of behavior.Shows NLA explanations capture latent model beliefs about rewards before output selection; validates interpretability.
+54 more