hypothesis

active

hypothesis:the-primer-architecture-s-depthwise-convolution-change-would-allow-induction-heads-to-form-without-requiring-k-composition

The Primer architecture's depthwise convolution change would allow induction heads to form without requiring K-composition

Architectural interpretation of how Primer's design change relates to the paper's mechanistic theory of induction heads

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Concepts (1)

concept

Primer Architecture
supports
A transformer variant discovered via automated architecture search that includes depthwise convolution over last three positions in key/query computation, making induction heads expressible without K-composition

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large models form many induction heads built from K-composition with a previous token head, making induction heads a central driver of in-context learning at all scalesclaim0.777
Forward-looking claim connecting toy model findings to large-scale language models
Induction heads work by using K-composition with a previous token head to shift keys by one token, then matching the current destination token against shifted keys to predict what followsclaim0.774
The mechanistic explanation of how induction heads are implemented in two-layer models
Induction heads explain in-context learning in small models and only develop in models with at least two attention layersclaim0.744
Central empirical claim of the paper; induction heads are shown to be the mechanism for powerful in-context learning
All induction heads fall in an extreme corner of high OV eigenvalue positivity and high QK eigenvalue positivity, confirming the mechanistic theoryclaim0.741
Quantitative verification that the copying and matching structure predicted by the mechanistic theory is present in all observed induction heads
Evolutionary transitions in individuality constitute a form of deep model induction.claim0.739
Links ETIs to the learning of hierarchical representations.
The mathematical framework and induction head concept will remain at least partially relevant for larger, more realistic modelshypothesis0.734
Central motivating hypothesis for the forthcoming paper on in-context learning and induction heads
In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.722
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
GPT-2 implements at least one induction head using pointer arithmetic on positional embeddings rather than K-compositionhypothesis0.720
Observation of an alternative induction head implementation algorithm in larger models with positional embeddings in the residual stream