finding

active

finding:top-5-instructions-by-1-2-at-l-12-achieve-average-cosine-similarity-9893-and-average-accuracy-5645-on-gsm8k-adv-for-gemma3-4b-it

Top-5 instructions by µ(1→2) at ℓ=12 achieve average cosine similarity .9893 and average accuracy .5645 on gsm8k_adv for Gemma3-4B-IT

High cosine similarity for Gemma3 steering vectors suggests strong linear reflection structure.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Claims (1)

claim

Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.
supports
Core applied contribution claim, supported by top-k accuracy comparisons.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pairfinding0.812
Shows persona space axes are inherited from pre-training, not solely created by post-training
Triggered Reflection with 'Alternatively' achieves accuracy .684 on gsm8k_adv for Gemma3-4B-ITfinding0.792
Highest single-instruction accuracy result in the paper.
Experimental condition adjective embeddings show mean cosine similarity 0.657 (n=9,591 pairs), significantly higher than history (0.628, t=15.8, p=1.4×10⁻⁵⁵), conceptual (0.587, t=38.5, p<10⁻³⁰⁰), and zero-shot (0.603, t=35.1, p=4.3×10⁻²⁶²)finding0.784
Core result of Experiment 3: cross-model semantic convergence under self-referential processing
QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)finding0.781
Demonstrates that stronger models are largely insensitive to reflection manipulation
In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)finding0.781
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.779
Appendix E replication of DIM alignment finding in Qwen model
No Reflection with 'Answer' achieves accuracy .037 on gsm8k_adv for Qwen2.5-3Bfinding0.775
Baseline accuracy when reflection is suppressed.
Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.773
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.