Cosine similarity between truth probes

Geometric evaluation of truth direction alignment across layers and prompt templates.

Neighborhood — ranked by edge-count

Papers (1)

paper

Testing the Limits of Truth Directions in LLMs
uses

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cosine Similarity-Based Deception Detectionconcept0.837
Detection mechanism computing cosine similarity between activation vectors and steering vectors to classify deception
Cosine Similarity Measurementmethod0.831
Used to measure alignment between DIM direction and cone basis vectors to assess overlap
Masked Cosine Similaritymethod0.797
Cosine similarity between feature activations restricted to tokens where one of the features fires; used to identify feature splitting relationships
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.781
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Pairwise Cosine Similarity Analysismethod0.781
Used to quantify the semantic clustering of adjective-set embeddings across model families and conditions
Cosine Similarity Ranking for Instruction Discoverymethod0.780
Method to discover new reflection-inducing instructions by ranking candidate tokens by cosine similarity to steering vectors.
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.775
Appendix E replication of DIM alignment finding in Qwen model
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.774
Shows the passive vs. active divide is more important than the specific wording of instructions.