claim

active

claim:some-mlp-neurons-and-attention-heads-perform-memory-management-by-reading-residual-stream-information-and-writing-its-negative-to-delete-it

Some MLP neurons and attention heads perform memory management by reading residual stream information and writing its negative to delete it

Hypothesis based on observed negative cosine similarity between input and output weights of some neurons

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Papers (1)

paper

A Mathematical Framework for Transformer Circuits
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

What are the specific attention heads or MLP neurons (circuits) responsible for self-reflection in LLMs?question0.817
Future research question about pinpointing fine-grained mechanistic components of reflection.
Memory Management Neurons/Headsconcept0.802
MLP neurons and attention heads hypothesized to delete information from the residual stream by writing the negative of what they read
Some attention heads partially specialize in copying for words that split into two tokens without a space prefix, attending from fragmented token to complete tokenfinding0.774
Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads
MLP layers are much harder to get traction on than attention layers; understanding them requires individually interpretable neurons which are rarely foundclaim0.769
Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters
A sparse set of 28 MLP neurons at layer 18 (~0.2% of MLP) are reused across all cyclic tasksfinding0.763
Quantitative finding identifying the specific neurons responsible for generic addition
Attention heads can be understood as independent operations each adding their output to the residual stream, equivalent to the concatenate-and-multiply formulationclaim0.762
Mathematical equivalence enabling independent analysis of each attention head
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.762
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
When and how can MLP neurons in transformers be individually interpreted, and what progress is needed to extend mechanistic interpretability to them?question0.761
Major open problem identified in the paper; MLP layers constitute 2/3 of transformer parameters