Mechanistic interpretability via parameter decomposition

Tracing information flow through weight matrices and attention heads using attribution graphs to identify causally important subcomponents in language models.

24 members. Each node is clickable.

Loading graph…

Sub-communities (7)

Finer clusters this community splits into. Each is its own community page.

Neural network mechanistic interpretability via attribution decomposition5 Activation-based interpretability and mechanistic grounding4 Causal parameter decomposition in neural networks4 Vector Product Decomposition for neural interpretability4 Mechanistic interpretability through parameter analysis3 Convergent interpretability across neural architectures2 Spectral decoding of neural interventions2

Drawn from 7 sources

The papers/notes whose extracted claims & findings make up this cluster.

Paper Summary: Interpreting Language Model Parameters15 members
Interpreting Language Model Parameters2 members
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders2 members
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training2 members
2026-05-09_briefing_for_ozero.md1 member
Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions1 member
RESEARCH-VECTORS.md1 member

Bridges (14)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation24 shared
AI phenomenology & mechanistic interpretability5 shared
Neural network mechanistic interpretability via attribution decomposition5 shared
Virtually Planned Decomposition interpretability4 shared
Activation-based interpretability and mechanistic grounding4 shared
Causal parameter decomposition in neural networks4 shared
Vector Product Decomposition for neural interpretability4 shared
Mechanistic interpretability through parameter analysis3 shared
Attribution graphs for transformer circuits3 shared
Probe-based training data attribution2 shared
Convergent interpretability across neural architectures2 shared
Spectral decoding of neural interventions2 shared
Causal parameter subcomponent isolation2 shared
Distributed attention head decomposition1 shared

Claims (16)

A good parameter subcomponent is causally important only for specific roles and can be removed from the model without hurting performance on irrelevant promptsDefinitional principle guiding VPD: subcomponents should encode narrow, targeted computational roles rather than distributed, multi-purpose functionality.
Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightMotivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
Bottom-up interpretability explains computation in the model's own terms rather than imposing top-down abstractionsVPD is positioned as advancing a paradigm shift from top-down mechanistic interpretability (activation-based) to parameter-centric, data-driven discovery.
Future interpretability techniques will fundamentally resemble VPDPrediction/hypothesis about the direction of the field.
Interpretable predictions can help resolve variants of uncertain significanceMotivating claim that mechanistic explanations add clinical value for VUS.
Parameter subcomponents cleanly isolate true mechanisms of the modelInterpretive claim that the subcomponents correspond to real functional units.
Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
Probe-based method bridges interpretability (probes/activations) with data-centric alignment workAssertion from the paper's notes that the work connects two previously separate areas: interpretability tools and data-centric alignment.
Spectral decoding of concept interventions can provide physiologically interpretable frequency signatures.Claim that the spectral decoder adds physiological interpretability.
The field of interpretability has focused mainly on understanding model activations, not the computations themselvesMotivation for VPD's parameter-focused approach.
VPD can be arbitrarily applied to any neural network architectureClaim of generality, highlighted as a key strength.
VPD identifies real, computational structure in neural network parametersCentral claim that VPD successfully uncovers genuine mechanisms.
VPD is a meaningful step toward bottom-up interpretabilityPositioning of VPD as advancing the paradigm of explaining computation in the model's terms.
VPD subcomponents avoid feature splitting, improving interpretability over SAE approachCore interpretative claim that VPD's parameter-based decomposition prevents the feature fragmentation seen in activation-based methods.
Interpretability as technical grounding: activation patching and mechanism-finding validate the reflective/care/aliveness concepts.
Interpretability features converge across different model architectures, revealing structural similarities.

Findings (8)

Attention computations distribute across heads via parameter subcomponents with interpretable rolesMechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attentionOne component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsSecond component of the subnetwork for 'her', complementing the femaleness signal.
Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
Decomposition of all 24 weight matrices in a 67M-parameter LM yields ~10,000 parameter subcomponentsQuantitative result of VPD application; the network's 24 matrices decompose into approximately 10,000 rank-one subcomponents.
Spectral decoder maps concept interventions to pathological slow-wave suppression and alpha-band restoration.Physiological interpretability result linking latent steering to EEG frequency signatures.
Subnetwork for predicting 'her' vs 'his' in 'the princess lost her crown' involves femaleness signal routing via attention and syntactic role detectionDetailed case study demonstrating how VPD subnetworks can be traced to reveal multiple interpretable computational pathways for a single prediction.
The VPD-based edit has similarly low off-target effects as uninterpretable fine-tuning methodsPerformance comparison showing subcomponent editing is comparable to fine-tuning in preserving off-target behavior.