finding
active
finding:in-a-4-over-100-features-primarily-respond-to-the-token-the-in-different-contextsIn A/4, over 100 features primarily respond to the token 'the' in different contexts
Demonstrates prevalence of token-in-context features and feature splitting of common tokens
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors argue the prevalence of token-in-context features reflects genuine model computation rather than dictionary learning artifact
Concepts (1)
concept
- Token-in-Context FeaturesupportsFeature that fires on a specific token only within a specific surrounding context (e.g., 'the' in physics vs 'the' in mathematics)
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Concrete example of feature splitting revealing unexpected model structure
- Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
- Demonstrates activation specificity of the Arabic script sparse autoencoder feature
- Basic SAE performance metrics.
- Demonstrates mechanistic memorization via feature assemblies in superposition
- Robustness check on token choice for binary classification
- ATLAS hypothesis that a compact set of high-level functional tokens (Manip, Shape, Line, Arrow, Text) suffices for multi-domain visual reasoning.
- Demonstrates NLAs' ability to surface hypotheses that lead to discovery of root cause (malformed training data).