Talking Heads Attention

A transformer variant where OV and QK matrices of different attention heads can share components, enabling shared copying mechanisms

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Attention headsconcept0.838
Transformer attention heads that could be recruited to extract different kinds of information (text vs. thoughts).
Virtual Attention Headconcept0.747
The composition of two attention heads via V-composition, forming a new entity with its own attention pattern A^h2 * A^h1 and OV matrix W_OV^h2 * W_OV^h1
attention head localization analysismethod0.745
Analysis measuring whether each attention head's maximum attention increase points to the correct injected sentence
Self-attentionconcept0.719
A form of key-query attention within a single input sequence; core to Transformers.
Induction Headsconcept0.709
Mechanistic circuits in transformers documented by Olsson et al. 2022, cited as evidence for pattern-repository assumption
Some attention heads partially specialize in copying for words that split into two tokens without a space prefix, attending from fragmented token to complete tokenfinding0.686
Interesting special case of copying behavior related to tokenization artifacts; primitive precursor to induction heads
attention computationconcept0.678
Process using Q, K, V to compute a heat map over K and weighted sum of V.
paying attention to the wholenessconcept0.675
The act of seeing and feeling the entire field of centers at a place, which Alexander equates with love of life.