claim
active
claim:das-s-access-to-model-outputs-during-training-is-responsible-for-much-of-its-advantage-over-other-interpretability-methods

DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methods

Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity

Source paper

extracted_from
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Findings (3)

finding

Questions (1)

question

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.