method
active
method:causal-attention-maskCausal Attention Mask
Modification to transformer restricting keys and values to previous time-steps only, mimicking how an agent accumulates experiences.
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- TEM-Transformer (TEM-t)implementsThe transformer version directly analogous to TEM, introduced in this paper, offering dramatic performance improvements.
Methods (1)
method
- causally-masked attentionrelated_toAttention mechanism with causal mask limiting each token's view to previous tokens; used in decoder-only transformers
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Attention restricted to previous tokens only, as in decoder-only models; leads to AR(ω)-like behaviour and no ordered phase
- Application to transformer language models
- Core operation in transformers, computing weighted combinations of previous elements
- A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
- Whether an internal direction causally controls a target behavior, verified by intervention success
- The use of interventions (rather than correlations) to establish a causal link between representation geometry and behavioral geometry.
- A measure of whether a subcomponent is necessary to reproduce model behavior on a specific prompt, predicted by the causal importance network.
- A form of key-query attention within a single input sequence; core to Transformers.