Single-Token Features

Features that fire on every instance of a single token; appear in small dictionaries as collapsed versions of many token-in-context features

Neighborhood — ranked by edge-count

concept

Feature splitting
associated_with
Phenomenon where a feature in a small SAE splits into multiple finer features in a larger SAE.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Token-in-Context Featureconcept0.750
Feature that fires on a specific token only within a specific surrounding context (e.g., 'the' in physics vs 'the' in mathematics)
Tokenconcept0.748
Basic unit of LLM input/output: words, parts of words, punctuation marks, emojis
Pure Featureconcept0.726
A feature that responds to only a single latent variable, contrasted with polysemantic features
Functional Tokenconcept0.725
A discrete token in the vocabulary that represents a visual operation (e.g., <|Line|>, <|Shape|>, <|Text|>), generated via next-token prediction within autoregressive sequences.
Feature Universalityconcept0.720
Property of features that form consistently across different models trained on the same or similar data, suggesting features are real representational units
what features activate on tokens we'd expect to signify Claude's self-identity?question0.712
Open question from the discussion on future research directions.
Token embeddingsconcept0.703
Vector representations of individual tokens from genomic foundation models; the raw inputs to sequence pooling methods.
Summarization Token Behaviorconcept0.700
Behavior where information about full clauses is encoded over clause-ending punctuation tokens in LLMs