Virtual Attention Head

The composition of two attention heads via V-composition, forming a new entity with its own attention pattern A^h2 * A^h1 and OV matrix W_OV^h2 * W_OV^h1

Neighborhood — ranked by edge-count

Papers (1)

paper

A Mathematical Framework for Transformer Circuits
introduces

Claims (1)

claim

Attention is a generalization of convolution; all convolutions can be expressed as tensor products of fixed relative position attention patterns and weight matrices
supports
Mathematical equivalence showing the relationship between attention mechanisms and convolutional operations

Concepts (1)

concept

V-Composition
implements
A form of attention head composition where W_V reads from a subspace affected by a previous head, creating virtual attention heads

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Attention headsconcept0.828
Transformer attention heads that could be recruited to extract different kinds of information (text vs. thoughts).
attention head localization analysismethod0.770
Analysis measuring whether each attention head's maximum attention increase points to the correct injected sentence
Talking Heads Attentionconcept0.747
A transformer variant where OV and QK matrices of different attention heads can share components, enabling shared copying mechanisms
Self-attentionconcept0.728
A form of key-query attention within a single input sequence; core to Transformers.
attention computationconcept0.723
Process using Q, K, V to compute a heat map over K and weighted sum of V.
Virtual attention heads (V-composition) may be much more important in larger and more complex transformers than in two-layer toy modelshypothesis0.722
Forward-looking speculation based on the theoretical elegance and combinatorial growth of virtual head count with depth
Attention algorithms are usually distributed across attention headsclaim0.721
Claim supported by VPD's recovery of cross-head attention subcomponents, noted in footnote.
Attention heads can be understood as independent operations each adding their output to the residual stream, equivalent to the concatenate-and-multiply formulationclaim0.709
Mathematical equivalence enabling independent analysis of each attention head