Cosine Similarity-Based Deception Detection

Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception

Neighborhood — ranked by edge-count

method

Cosine Similarity Binary Classifier
implements
Classifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cosine similarity between truth probesmethod0.837
Geometric evaluation of truth direction alignment across layers and prompt templates.
Cosine Similarity Measurementmethod0.833
Used to measure alignment between DIM direction and cone basis vectors to assess overlap
Masked Cosine Similaritymethod0.819
Cosine similarity between feature activations restricted to tokens where one of the features fires; used to identify feature splitting relationships
Cosine Similarity Ranking for Instruction Discoverymethod0.791
Method to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.
Pairwise Cosine Similarity Analysismethod0.774
Used to quantify the semantic clustering of adjective-set embeddings across model families and conditions
Fact-Based Deception Under Coercive Circumstancesframework0.762
First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts
Feature neighborhood exploration via cosine similarity of decoder weightsmethod0.761
Identifying related features by cosine distance in SAE decoder space.
Cosine projection on reflection directionmethod0.736
Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers