finding

active

finding:in-a-4-over-100-features-primarily-respond-to-the-token-the-in-different-contexts

In A/4, over 100 features primarily respond to the token 'the' in different contexts

Demonstrates prevalence of token-in-context features and feature splitting of common tokens

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

The transformer likely uses a local code for token-in-context features rather than purely compositional representations, because local codes enable sharper predictions
associated_with
Authors argue the prevalence of token-in-context features reflects genuine model computation rather than dictionary learning artifact

Concepts (1)

concept

Token-in-Context Feature
supports
Feature that fires on a specific token only within a specific surrounding context (e.g., 'the' in physics vs 'the' in mathematics)

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Single base64 feature A/0/45 splits into three distinct features in A/1: letter-specific, digit-specific, and ASCII-encoding-specificfinding0.748
Concrete example of feature splitting revealing unexpected model structure
48 of 171 emotion probes individually significant at token 100 post-steeringfinding0.746
Shows that causal steering effects persist over long ranges for a substantial fraction of emotion probes
Arabic script feature A/1/3450 fires on 81% Arabic-script tokens when active, with 98% specificity at high activation levelsfinding0.740
Demonstrates activation specificity of the Arabic script sparse autoencoder feature
For all three SAEs (1M, 4M, 34M), average active features per token <300, and reconstruction variance explained ≥65%.finding0.740
Basic SAE performance metrics.
In A/4, features functionally memorize 'MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE' via FSA-like feature chainfinding0.735
Demonstrates mechanistic memorization via feature assemblies in superposition
Alternative tokenizations Yes/No vs yes/no vs true/false had no significant effect on steering outcomes or ASRfinding0.729
Robustness check on token choice for binary classification
Five functional tokens can generalize across 40+ diverse visual reasoning taskshypothesis0.728
ATLAS hypothesis that a compact set of high-level functional tokens (Manip, Shape, Line, Arrow, Text) suffices for multi-domain visual reasoning.
Opus 4.6 spontaneously responded in Russian to an English prompt; NLA explanations revealed the model was fixated on the hypothesis that the user was a non-native English speaker.finding0.727
Demonstrates NLAs' ability to surface hypotheses that lead to discovery of root cause (malformed training data).