thinker
active
thinker:dan-jurafsky

Dan Jurafsky

Authored
1
Introduces
0
Studies
0
Affiliations
1
Cited by
3

Authored papers (1)

  • CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-means, LDA, PCA, k-means, and random baselines in causally influencing language model behavior via 1D distributed interchange intervention. Across the pythia model family (14M–6.9B parameters), DAS achieves an average log odds-ratio of 10.74 on pythia-1b compared to 3.66 for probing and 3.17 for difference-in-means, measured over 400 training and 100 evaluation examples per task. However, when selectivity is computed by subtracting performance on control tasks that require arbitrary token mappings—an adaptation of Hewitt and Liang's (2019) probing control paradigm—the gap between DAS and probing narrows substantially, revealing that DAS's advantage partly reflects its expressivity rather than genuine causal alignment. Applying DAS to track training checkpoints of pythia-1b on NPI licensing (npi_any_subj-relc) and filler-gap dependencies (filler_gap_subj) shows that the causal mechanism for both phenomena emerges in discrete stages—not gradually—with information traversing multiple intermediate token positions before reaching the output, and both mechanisms appearing fully only after step 2000–3000 of training. The paper argues this implies that interpretability evaluation requires causal interventional paradigms rather than behavioral or representational proxies alone, and that psycholinguistic LM research should move beyond surprisal comparisons toward mechanistic analysis.

More papers — OpenAlex / S2

Affiliations (1)

Co-authors (2)

Recent mentions (1)