claim

active

claim:mlp-layers-are-much-harder-to-get-traction-on-than-attention-layers-understanding-them-requires-individually-interpretable-neurons-which-are-rarely-found

MLP layers are much harder to get traction on than attention layers; understanding them requires individually interpretable neurons which are rarely found

Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Papers (1)

paper

A Mathematical Framework for Transformer Circuits
introduces

Concepts (2)

concept

Polysemanticity
supports
Neurons that respond to multiple unrelated concepts, limiting interpretability.
GeLU Activation Function
supports
The nonlinear activation function used in MLP layers; prevents the linearization approach used for attention layers from extending to MLP layers

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Multi-layer Perceptron (MLP)method0.789
Feed-forward neural network with hidden layers, capable of representing non-linearly separable functions.
A sparse set of 28 MLP neurons at layer 18 (~0.2% of MLP) are reused across all cyclic tasksfinding0.782
Quantitative finding identifying the specific neurons responsible for generic addition
to what extent do interpretable features represent the 'full story' of the MLP layer?question0.773
Question about completeness of feature-based model explanation
The 28 MLP neurons at layer 18 can be partitioned into disjoint clusters each computing the sum for a Fourier feature with a different periodfinding0.773
Structural finding showing modular organization within the sparse neuron set
What are the specific attention heads or MLP neurons (circuits) responsible for self-reflection in LLMs?question0.770
Future research question about pinpointing fine-grained mechanistic components of reflection.
Some MLP neurons and attention heads perform memory management by reading residual stream information and writing its negative to delete itclaim0.769
Hypothesis based on observed negative cosine similarity between input and output weights of some neurons
When and how can MLP neurons in transformers be individually interpreted, and what progress is needed to extend mechanistic interpretability to them?question0.765
Major open problem identified in the paper; MLP layers constitute 2/3 of transformer parameters
Polysemantic neurons are a major challenge for the circuits agenda, because N meanings in one neuron times M in another creates NxM effective connections that cannot be considered individually.claim0.761
Precise characterization of why polysemanticity poses a combinatorial obstacle to circuit analysis