finding
active
finding:subcomponent-l2-mlp-down-3382-density-0-00-predicts-emoticon-continuations-after-colon-semicolon-or-equalsSubcomponent L2.MLP.down:3382 (density 0.00%) predicts emoticon continuations after colon, semicolon, or equals
Specific discovered subcomponent that activates on punctuation like ' :', ' ;', ' =', ':-' and predicts the rest of emoticons/emojis.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Core design principle of VPD: each parameter subcomponent is constrained to be a simple rank-one matrix to enable isolated understanding and combination.
Hypotheses (1)
hypothesis
- Core empirical hypothesis of the paper, supported by successful VPD decomposition yielding ~10,000 interpretable subcomponents across 24 weight matrices.
Communities (2)
community
- Few-shot anchoring & latent structuremembers_ofHow minimal examples disambiguate and recruit latent arithmetic/reasoning interpretations in LLMs
- Direct modification of model subcomponents (MLPs, embeddings, unembedding vectors) to predictably alter outputs without retraining, using rank-one constraints.
Questions (1)
question
- First question posed after applying VPD, investigating whether the subcomponents make sense.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- SAE features are not simply mirroring individual neurons.
- The functional role of a specific VPD subcomponent in predicting emoticon/emoji continuations after punctuation.
- The part of the emoticon subcomponent responsible for recognizing the 'eyes' of emoticons like ';', ':' or '=', which was edited in the demo.
- Claim about the sparsity and sufficiency of the identified neuron set
- Direct parameter subcomponent overwrite produces a clean behavioral change without training.
- Interpretation of cross-base transfer asymmetry.
- Demonstrates that activation similarity can diverge from logit weight similarity due to interference
- Shows model persona position is primarily determined by the most recent user message, not prior drift