method
active
method:cosine-similarity-ranking-for-instruction-discoveryCosine Similarity Ranking for Instruction Discovery
Method to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Input Embedding Similarity Baselineassociated_withBaseline method for instruction discovery using surface-level input embedding similarity instead of steering vectors.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Used to measure alignment between DIM direction and cone basis vectors to assess overlap
- Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy
- Used to quantify the semantic clustering of adjective-set embeddings across model families and conditions
- Identifying related features by cosine distance in SAE decoder space.
- Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception
- Geometric evaluation of truth direction alignment across layers and prompt templates.
- Demonstrates that surface-level embedding similarity fails to capture reflective semantics.
- Shows the passive vs. active divide is more important than the specific wording of instructions.