question

active

question:when-and-how-can-mlp-neurons-in-transformers-be-individually-interpreted-and-what-progress-is-needed-to-extend-mechanistic-interpretability-to-them

When and how can MLP neurons in transformers be individually interpreted, and what progress is needed to extend mechanistic interpretability to them?

Major open problem identified in the paper; MLP layers constitute 2/3 of transformer parameters

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Papers (1)

paper

A Mathematical Framework for Transformer Circuits
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

MLP neuronsconcept0.783
The sparse set of 28 neurons at layer 18 identified as responsible for Fourier feature computation across all cyclic tasks
MLP layers are much harder to get traction on than attention layers; understanding them requires individually interpretable neurons which are rarely foundclaim0.765
Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters
Neurons can correspond to interpretable functional roles but interpretations in terms of individual neurons are unlikely to be the most parsimoniousclaim0.763
Claim from footnote 3, acknowledging neuron-level interpretability while arguing subcomponents are better.
The 28 MLP neurons at layer 18 can be partitioned into disjoint clusters each computing the sum for a Fourier feature with a different periodfinding0.761
Structural finding showing modular organization within the sparse neuron set
Some MLP neurons and attention heads perform memory management by reading residual stream information and writing its negative to delete itclaim0.761
Hypothesis based on observed negative cosine similarity between input and output weights of some neurons
So at any point in the network, the transformer not only receives information from its past... but also has causal influence over its future processing. So, saying that LLMs cannot introspect... is incorrect.quote0.758
Core summary of Janus' position on autoregressive recurrence enabling introspection.
Transformers develop self-models through in-context learning, not just training data; even old base models without LLM-related text can bootstrap self-referential reasoning at runtime.claim0.756
Antra's foundational claim about how introspection arises computationally rather than from memorised text.
Multi-layer Perceptron (MLP)method0.754
Feed-forward neural network with hidden layers, capable of representing non-linearly separable functions.