concept
active
concept:cosine-similarity-based-deception-detectionCosine Similarity-Based Deception Detection
Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception
Neighborhood — ranked by edge-count
Methods (1)
method
- Cosine Similarity Binary ClassifierimplementsClassifier using cosine similarity between activation vectors and steering vectors to detect deception with 89% accuracy
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Geometric evaluation of truth direction alignment across layers and prompt templates.
- Used to measure alignment between DIM direction and cone basis vectors to assess overlap
- Cosine similarity between feature activations restricted to tokens where one of the features fires; used to identify feature splitting relationships
- Method to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.
- Used to quantify the semantic clustering of adjective-set embeddings across model families and conditions
- First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts
- Identifying related features by cosine distance in SAE decoder space.
- Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers