method

active

method:feature-neighborhood-exploration-via-cosine-similarity-of-decoder-weights

Feature neighborhood exploration via cosine similarity of decoder weights

Identifying related features by cosine distance in SAE decoder space.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cosine Similarity Binary Classifiermethod0.793
Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy
Cosine Similarity Ranking for Instruction Discoverymethod0.791
Method to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.
Cosine Similarity-Based Deception Detectionconcept0.761
Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception
Cosine similarity between truth probesmethod0.751
Geometric evaluation of truth direction alignment across layers and prompt templates.
Feature attribution via gradient dot product with SAE decodermethod0.751
Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
Pairwise Cosine Similarity Analysismethod0.746
Used to quantify the semantic clustering of adjective-set embeddings across model families and conditions
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.745
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Features are connected by weights forming circuits, and these circuits can be rigorously studied and understood as meaningful algorithms.claim0.744
Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study