claim

active

claim:for-both-npi-and-filler-gap-tasks-the-model-initially-learns-to-move-information-directly-from-alternating-token-to-output-intermediate-steps-are-added-later-in-training

For both NPI and filler-gap tasks, the model initially learns to move information directly from alternating token to output; intermediate steps are added later in training

Mechanistic interpretation of training dynamics in case studies

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
introduces

Findings (2)

finding

Filler-gap mechanism in pythia-1b crosses over several different positions before arriving at output position
supports
Mechanistic finding from CausalGym case study showing complex multi-step movement for filler-gap
NPI mechanism in pythia-1b moves negation feature through complementiser 'that', auxiliary verb, and main verb across layers before predicting NPI 'any'
supports
Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The mechanisms implementing NPI licensing and filler-gap dependencies are learned in discrete stages, not graduallyclaim0.845
Main mechanistic finding from case studies; evidence from training checkpoint analysis of pythia-1b
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.769
Selective pressure toward convergence via task generality
Filler-gap dependency mechanism in pythia-1b emerges in two discrete stages (steps 2000 and 10K) not graduallyfinding0.766
Training dynamics finding showing filler-gap takes longer to learn than NPI licensing
Software implementations for all of the models/behaviours presented are common for n = 2, and can be made very efficient for α_i that map many objects onto a much smaller set of object families.claim0.763
Claim about current practical feasibility and efficiency of 2-way associative implementations.
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.759
Central interpretive claim and motivation for future work
When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.754
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.753
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Transformers develop self-models through in-context learning, not just training data; even old base models without LLM-related text can bootstrap self-referential reasoning at runtime.claim0.753
Antra's foundational claim about how introspection arises computationally rather than from memorised text.