claim
active
claim:probe-based-method-bridges-interpretability-probes-activations-with-data-centric-alignment-workProbe-based method bridges interpretability (probes/activations) with data-centric alignment work
Assertion from the paper's notes that the work connects two previously separate areas: interpretability tools and data-centric alignment.
Source paper
extracted_from(2026) · Frank Xiao · Santiago Aranguri
Neighborhood — ranked by edge-count
Communities (4)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Tracing information flow through weight matrices and attention heads using attribution graphs to identify causally important subcomponents in language models.
- Probe-based training data attributionmembers_ofUses linear probes on activations to identify and filter harmful training data cheaply (~$30).
- Uses probes, activation patching, and mechanistic analysis to ground abstract concepts in model computations, bridging interpretability with data-centric alignment.
Methods (1)
method
- Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Shows the key divide is passive vs. active framing, not the specific wording of instructions.
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
- Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
- Shows the passive vs. active divide is more important than the specific wording of instructions.
- Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').