claim
active
claim:das-s-access-to-model-outputs-during-training-is-responsible-for-much-of-its-advantage-over-other-interpretability-methodsDAS's access to model outputs during training is responsible for much of its advantage over other interpretability methods
Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity
Source paper
extracted_from(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (3)
finding
- Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
- Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96supportsKey result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity
- Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
Questions (1)
question
- Core methodological question motivating the introduction of selectivity and control tasks
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.814Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
- Motivation for VPD's parameter-focused approach.
- Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
- Describes scaffolding method and the model's meta-learning loop.
- Ethical implication about the nature of AI training experience if the thesis holds
- Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
- Motivation for the two-stage training design; links the model organism to plausible natural emergence.
- Second falsifiable prediction linking objective function structure to valence profile