finding

active

finding:subcomponent-l2-mlp-down-3382-density-0-00-predicts-emoticon-continuations-after-colon-semicolon-or-equals

Subcomponent L2.MLP.down:3382 (density 0.00%) predicts emoticon continuations after colon, semicolon, or equals

Specific discovered subcomponent that activates on punctuation like ' :', ' ;', ' =', ':-' and predicts the rest of emoticons/emojis.

Source paper

extracted_from

cimcWhitepaper

Neighborhood — ranked by edge-count

Claims (1)

claim

Rank-one matrix decomposition constraint enforcing mechanistic simplicity
supports
Core design principle of VPD: each parameter subcomponent is constrained to be a simple rank-one matrix to enable isolated understanding and combination.

Hypotheses (1)

hypothesis

Language models contain interpretable computational structure encoded in their parameter weights, not irreducibly impenetrable complexity
answered_by
Core empirical hypothesis of the paper, supported by successful VPD decomposition yielding ~10,000 interpretable subcomponents across 24 weight matrices.

Communities (2)

community

Few-shot anchoring & latent structure
members_of
How minimal examples disambiguate and recruit latent arithmetic/reasoning interpretations in LLMs
Mechanistic editing through parameter surgical intervention
members_of
Direct modification of model subcomponents (MLPs, embeddings, unembedding vectors) to predictably alter outputs without retraining, using rank-one constraints.

Questions (1)

question

are the resulting parameter subcomponents actually interpretable objects?
answered_by
First question posed after applying VPD, investigating whether the subcomponents make sense.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.750
SAE features are not simply mirroring individual neurons.
Emoticon continuation predictionconcept0.748
The functional role of a specific VPD subcomponent in predicting emoticon/emoji continuations after punctuation.
Emoticon eye subcomponentconcept0.743
The part of the emoticon subcomponent responsible for recognizing the 'eyes' of emoticons like ';', ':' or '=', which was edited in the demo.
Approximately 0.2% of MLP neurons at layer 18 (~28 neurons) are sufficient to account for the generic addition computation across all cyclic tasksclaim0.742
Claim about the sparsity and sufficiency of the identified neuron set
Editing the emoticon eye subcomponent to output the unembedding vector for 'o' causes the model to predict shocked faces for all emoticonsfinding0.740
Direct parameter subcomponent overwrite produces a clean behavioral change without training.
Higher-density priors (B10) are more robust to fine-tuning than lower-density ones (B9).claim0.737
Interpretation of cross-base transfer asymmetry.
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.737
Demonstrates that activation similarity can diverge from logit weight similarity due to interference
User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10finding0.736
Shows model persona position is primarily determined by the most recent user message, not prior drift