finding

active

finding:editing-the-emoticon-eye-subcomponent-to-output-the-unembedding-vector-for-o-causes-the-model-to-predict-shocked-faces-for-all-emoticons

Editing the emoticon eye subcomponent to output the unembedding vector for 'o' causes the model to predict shocked faces for all emoticons

Direct parameter subcomponent overwrite produces a clean behavioral change without training.

Source paper

extracted_from

cimcWhitepaper

Neighborhood — ranked by edge-count

Claims (1)

claim

The ability to make precise edits demonstrates that VPD identifies real computational machinery
supports
Claim that editing success validates VPD's decomposition.

Communities (3)

community

Few-shot anchoring & latent structure
members_of
How minimal examples disambiguate and recruit latent arithmetic/reasoning interpretations in LLMs
Mechanistic editing through parameter surgical intervention
members_of
Direct modification of model subcomponents (MLPs, embeddings, unembedding vectors) to predictably alter outputs without retraining, using rank-one constraints.
Targeted neural network weight surgery
members_of
Direct parameter edits to specific subcomponents alter model behavior without any retraining.

Questions (1)

question

can we use these parameter subcomponents to perform clean, targeted changes?
answered_by
Implicit question driving the editing experiment.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Direct model editing via parameter subcomponent modification—emoticon eye recognition altered to predict shocked faces with no retrainingfinding0.867
Demonstrated that VPD-discovered subcomponents encode true computational machinery by enabling targeted, predictable behavior changes without gradient-based training.
Emoticon eye subcomponentconcept0.811
The part of the emoticon subcomponent responsible for recognizing the 'eyes' of emoticons like ';', ':' or '=', which was edited in the demo.
CLIP models exhibit higher language-vision alignment than supervised or self-supervised vision models, but this alignment decreases after fine-tuning on ImageNet classificationfinding0.742
CLIP training paradigm finding in cross-modal alignment
Subcomponent L2.MLP.down:3382 (density 0.00%) predicts emoticon continuations after colon, semicolon, or equalsfinding0.740
Specific discovered subcomponent that activates on punctuation like ' :', ' ;', ' =', ':-' and predicts the rest of emoticons/emojis.
Editing NLA explanations to change 'reward' to 'penalty' produces steering vector that increases odd-number responses from near-zero to >70%, demonstrating belief capture upstream of behavior.finding0.727
Shows NLA explanations capture latent model beliefs about rewards before output selection; validates interpretability.
Emoticon continuation predictionconcept0.726
The functional role of a specific VPD subcomponent in predicting emoticon/emoji continuations after punctuation.
AE-2: Embodiment: Modeling output-input contingencies and using the model in perception or controlconcept0.724
Indicator of embodiment requiring forward models used for perception or motor control.
One-layer model attention heads encode Python-specific skip-trigrams including indentation-based elif/else prediction and function signature patternsfinding0.724
Concrete example from examining expanded QK/OV matrices showing how specific programming language structure is encoded in attention weights