Cosine Similarity Binary Classifier

Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy

Neighborhood — ranked by edge-count

finding

LAT achieves 89% accuracy in detecting strategic deception in QwQ-32B activations
supports
Core detection result showing LAT-based steering vectors can identify deceptive states with high accuracy

concept

Cosine Similarity-Based Deception Detection
implements
Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cosine Similarity Measurementmethod0.831
Used to measure alignment between DIM direction and cone basis vectors to assess overlap
Pairwise Cosine Similarity Analysismethod0.817
Used to quantify the semantic clustering of adjective-set embeddings across model families and conditions
Cosine Similarity Ranking for Instruction Discoverymethod0.807
Method to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.
Feature neighborhood exploration via cosine similarity of decoder weightsmethod0.793
Identifying related features by cosine distance in SAE decoder space.
Masked Cosine Similaritymethod0.786
Cosine similarity between feature activations restricted to tokens where one of the features fires; used to identify feature splitting relationships
Cosine similarity between truth probesmethod0.766
Geometric evaluation of truth direction alignment across layers and prompt templates.
LLM Judge Binary Classifiermethod0.746
An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.736
Appendix E replication of DIM alignment finding in Qwen model