Cosine Similarity Ranking for Instruction Discovery

Method to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.

Neighborhood — ranked by edge-count

paper

method

Input Embedding Similarity Baseline
associated_with
Baseline method for instruction discovery using surface-level input embedding similarity instead of steering vectors.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cosine Similarity Measurementmethod0.815
Used to measure alignment between DIM direction and cone basis vectors to assess overlap
Cosine Similarity Binary Classifiermethod0.807
Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy
Pairwise Cosine Similarity Analysismethod0.803
Used to quantify the semantic clustering of adjective-set embeddings across model families and conditions
Feature neighborhood exploration via cosine similarity of decoder weightsmethod0.791
Identifying related features by cosine distance in SAE decoder space.
Cosine Similarity-Based Deception Detectionconcept0.791
Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception
Cosine similarity between truth probesmethod0.780
Geometric evaluation of truth direction alignment across layers and prompt templates.
Steering vector-based instruction discovery outperforms input embedding similarity baseline for reflection-inducing instruction selectionfinding0.773
Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.771
Shows the passive vs. active divide is more important than the specific wording of instructions.