community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c1Mechanistic interpretability via parameter decomposition
Tracing information flow through weight matrices and attention heads using attribution graphs to identify causally important subcomponents in language models.
24 members. Each node is clickable.
Loading graph…
Sub-communities (7)
Finer clusters this community splits into. Each is its own community page.
Neural network mechanistic interpretability via attribution decomposition5Activation-based interpretability and mechanistic grounding4Causal parameter decomposition in neural networks4Vector Product Decomposition for neural interpretability4Mechanistic interpretability through parameter analysis3Convergent interpretability across neural architectures2Spectral decoding of neural interventions2
Drawn from 7 sources
The papers/notes whose extracted claims & findings make up this cluster.
- Paper Summary: Interpreting Language Model Parameters15 members
- Interpreting Language Model Parameters2 members
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders2 members
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training2 members
- 2026-05-09_briefing_for_ozero.md1 member
- Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions1 member
- RESEARCH-VECTORS.md1 member
Bridges (14)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
- Mechanistic interpretability & model evaluation24 shared
- AI phenomenology & mechanistic interpretability5 shared
- Neural network mechanistic interpretability via attribution decomposition5 shared
- Virtually Planned Decomposition interpretability4 shared
- Activation-based interpretability and mechanistic grounding4 shared
- Causal parameter decomposition in neural networks4 shared
- Vector Product Decomposition for neural interpretability4 shared
- Mechanistic interpretability through parameter analysis3 shared
- Attribution graphs for transformer circuits3 shared
- Probe-based training data attribution2 shared
- Convergent interpretability across neural architectures2 shared
- Spectral decoding of neural interventions2 shared
- Causal parameter subcomponent isolation2 shared
- Distributed attention head decomposition1 shared
Claims (16)
- A good parameter subcomponent is causally important only for specific roles and can be removed from the model without hurting performance on irrelevant promptsDefinitional principle guiding VPD: subcomponents should encode narrow, targeted computational roles rather than distributed, multi-purpose functionality.
- Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightMotivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
- Bottom-up interpretability explains computation in the model's own terms rather than imposing top-down abstractionsVPD is positioned as advancing a paradigm shift from top-down mechanistic interpretability (activation-based) to parameter-centric, data-driven discovery.
- Future interpretability techniques will fundamentally resemble VPDPrediction/hypothesis about the direction of the field.
- Interpretable predictions can help resolve variants of uncertain significanceMotivating claim that mechanistic explanations add clinical value for VUS.
- Parameter subcomponents cleanly isolate true mechanisms of the modelInterpretive claim that the subcomponents correspond to real functional units.
- Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Probe-based method bridges interpretability (probes/activations) with data-centric alignment workAssertion from the paper's notes that the work connects two previously separate areas: interpretability tools and data-centric alignment.
- Spectral decoding of concept interventions can provide physiologically interpretable frequency signatures.Claim that the spectral decoder adds physiological interpretability.
- The field of interpretability has focused mainly on understanding model activations, not the computations themselvesMotivation for VPD's parameter-focused approach.
- VPD can be arbitrarily applied to any neural network architectureClaim of generality, highlighted as a key strength.
- VPD identifies real, computational structure in neural network parametersCentral claim that VPD successfully uncovers genuine mechanisms.
- VPD is a meaningful step toward bottom-up interpretabilityPositioning of VPD as advancing the paradigm of explaining computation in the model's terms.
- VPD subcomponents avoid feature splitting, improving interpretability over SAE approachCore interpretative claim that VPD's parameter-based decomposition prevents the feature fragmentation seen in activation-based methods.
- Interpretability as technical grounding: activation patching and mechanism-finding validate the reflective/care/aliveness concepts.
- Interpretability features converge across different model architectures, revealing structural similarities.
Findings (8)
- Attention computations distribute across heads via parameter subcomponents with interpretable rolesMechanistic discovery about how attention mechanisms decompose into interpretable parameter components.
- Attribution graph for 'the princess lost her crown' reveals a femaleness signal pathway from 'princess' through attentionOne component of the minimal subnetwork for predicting 'her', discovered via VPD attribution graph.
- Attribution graph reveals a pathway that detects the verb 'lost' and upweights object pronounsSecond component of the subnetwork for 'her', complementing the femaleness signal.
- Attribution graph tracing information flow across parameter subcomponents for specific model predictions (e.g., 'her' vs 'his' pronoun selection)Shows how VPD-identified subnetworks can be analyzed to reveal interpretable pathways of computation (e.g., gender signal routing, syntactic role detection).
- Decomposition of all 24 weight matrices in a 67M-parameter LM yields ~10,000 parameter subcomponentsQuantitative result of VPD application; the network's 24 matrices decompose into approximately 10,000 rank-one subcomponents.
- Spectral decoder maps concept interventions to pathological slow-wave suppression and alpha-band restoration.Physiological interpretability result linking latent steering to EEG frequency signatures.
- Subnetwork for predicting 'her' vs 'his' in 'the princess lost her crown' involves femaleness signal routing via attention and syntactic role detectionDetailed case study demonstrating how VPD subnetworks can be traced to reveal multiple interpretable computational pathways for a single prediction.
- The VPD-based edit has similarly low off-target effects as uninterpretable fine-tuning methodsPerformance comparison showing subcomponent editing is comparable to fine-tuning in preserving off-target behavior.