Anomalous Tokens

Extremely rare or never-used vocabulary elements that may distort logit weight analysis; excluded from feature analysis

Neighborhood — ranked by edge-count

method

Logit Weight Analysis
associated_with
Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Tokenconcept0.751
Basic unit of LLM input/output: words, parts of words, punctuation marks, emojis
Token embeddingsconcept0.724
Vector representations of individual tokens from genomic foundation models; the raw inputs to sequence pooling methods.
Functional Tokenconcept0.723
A discrete token in the vocabulary that represents a visual operation (e.g., <|Line|>, <|Shape|>, <|Text|>), generated via next-token prediction within autoregressive sequences.
Next Token Predictionconcept0.698
The training objective of LLMs: predicting the most likely next token given context; formally P(w_{n+1}|w_1...w_n)
Token-in-Context Featureconcept0.694
Feature that fires on a specific token only within a specific surrounding context (e.g., 'the' in physics vs 'the' in mathematics)
Anomaly detection mechanismconcept0.689
A possible circuit that triggers when activations deviate from expected values, hypothesized to underlie noticing injected thoughts.
All-token steeringmethod0.689
Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
Previous Token Headconcept0.676
An attention head that primarily attends to the immediately preceding token; key building block for induction heads via K-composition