finding

active

finding:npi-mechanism-in-pythia-1b-moves-negation-feature-through-complementiser-that-auxiliary-verb-and-main-verb-across-layers-before-predicting-npi-any

NPI mechanism in pythia-1b moves negation feature through complementiser 'that', auxiliary verb, and main verb across layers before predicting NPI 'any'

Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

For both NPI and filler-gap tasks, the model initially learns to move information directly from alternating token to output; intermediate steps are added later in training
supports
Mechanistic interpretation of training dynamics in case studies

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

NPI licensing mechanism in pythia-1b emerges in discrete stages (steps 1000, 2000, 3000) not graduallyfinding0.828
Training dynamics finding showing abrupt rather than gradual emergence of NPI mechanism
Filler-gap dependency mechanism in pythia-1b emerges in two discrete stages (steps 2000 and 10K) not graduallyfinding0.769
Training dynamics finding showing filler-gap takes longer to learn than NPI licensing
Filler-gap mechanism in pythia-1b crosses over several different positions before arriving at output positionfinding0.762
Mechanistic finding from CausalGym case study showing complex multi-step movement for filler-gap
Across 5 Pythia seeds, one seed fails to learn IOI task and another fails alignment despite learning the task; all other seeds achieve perfect alignment with ϕ_nonlinfinding0.756
Robustness check across seeds showing occasional failures of alignment map training
The mechanisms implementing NPI licensing and filler-gap dependencies are learned in discrete stages, not graduallyclaim0.756
Main mechanistic finding from case studies; evidence from training checkpoint analysis of pythia-1b
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.753
Key limitation acknowledged by authors.
DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.748
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
What are the neuronal mechanisms by which prior beliefs from one agent's model are received and properly implemented by a naive agent (neuronal hermeneutics)?question0.746
Open question about inter-agent communication beyond model-space assumption