method
active
method:feature-neighborhood-exploration-via-cosine-similarity-of-decoder-weightsFeature neighborhood exploration via cosine similarity of decoder weights
Identifying related features by cosine distance in SAE decoder space.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy
- Method to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.
- Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception
- Geometric evaluation of truth direction alignment across layers and prompt templates.
- Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
- Used to quantify the semantic clustering of adjective-set embeddings across model families and conditions
- Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
- Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study