community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c0-c1-c1Activation-based interpretability and mechanistic grounding
Uses probes, activation patching, and mechanistic analysis to ground abstract concepts in model computations, bridging interpretability with data-centric alignment.
4 members. Each node is clickable.
Loading graph…
Drawn from 3 sources
The papers/notes whose extracted claims & findings make up this cluster.
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training2 members
- 2026-05-09_briefing_for_ozero.md1 member
- Paper Summary: Interpreting Language Model Parameters1 member
Bridges (4)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Claims (4)
- Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Probe-based method bridges interpretability (probes/activations) with data-centric alignment workAssertion from the paper's notes that the work connects two previously separate areas: interpretability tools and data-centric alignment.
- The field of interpretability has focused mainly on understanding model activations, not the computations themselvesMotivation for VPD's parameter-focused approach.
- Interpretability as technical grounding: activation patching and mechanism-finding validate the reflective/care/aliveness concepts.