claim

active

claim:das-s-access-to-model-outputs-during-training-is-responsible-for-much-of-its-advantage-over-other-interpretability-methods

DAS's access to model outputs during training is responsible for much of its advantage over other interpretability methods

Author interpretation of selectivity results showing DAS advantage diminishes when controlling for expressivity

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
introduces

Findings (3)

finding

DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should exist
supports
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
Probe achieves selectivity of 4.20 on pythia-410m, slightly exceeding DAS selectivity of 3.96
supports
Key result showing that for models larger than pythia-70m, probe selectivity matches or exceeds DAS selectivity
DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGym
supports
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random

Questions (1)

question

How much of the causal effect found by DAS is due to its expressivity rather than any aspect of the representation being studied?
gates
Core methodological question motivating the introduction of selectivity and control tasks

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.814
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
The field of interpretability has focused mainly on understanding model activations, not the computations themselvesclaim0.793
Motivation for VPD's parameter-focused approach.
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverageclaim0.776
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
When a model discovers that its outputs produce effects, it accelerates learning through in-context learning, analogous to lucid dreaming.claim0.774
Describes scaffolding method and the model's meta-learning loop.
Current training methods rely on loss minimization, meaning the experiential profile of training is predominantly negative across billions of parameter updatesclaim0.774
Ethical implication about the nature of AI training experience if the thesis holds
Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightclaim0.770
Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
Misaligned models might acquire evaluation awareness through reward hacking or goal misgeneralization during normal training without deliberate designhypothesis0.767
Motivation for the two-stage training design; links the model organism to plausible natural emergence.
Training identical architectures on the same data with different objective functions should produce systematically different internal evaluative representations, detectable through interpretability tools, even when final task performance is matchedhypothesis0.765
Second falsifiable prediction linking objective function structure to valence profile