finding

active

finding:clear-accuracy-stratification-across-three-reflection-levels-on-cruxeval-o-adv-triggered-065-247-intrinsic-040-133-no-reflection-017-051-for-qwen2-5-3b-gemma3-4b-it

Clear accuracy stratification across three reflection levels on cruxeval_o_adv: Triggered (.065/.247) > Intrinsic (.040/.133) > No Reflection (.017/.051) for Qwen2.5-3B/Gemma3-4B-IT

Core empirical result validating the three-level reflection framework on code reasoning.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Claims (1)

claim

Reflection is not merely a behavioral artifact of prompting but a phenomenon encoded in the model's activation space.
supports
Central interpretive claim of the paper, supported by steering vector experiments.

Questions (1)

question

Do effective trigger instructions correspond to latent directions in the hidden space that implicitly induce the self-reflection process?
answered_by
Second key research question motivating the latent direction analysis.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Triggered Reflection with 'Alternatively' achieves accuracy .684 on gsm8k_adv for Gemma3-4B-ITfinding0.802
Highest single-instruction accuracy result in the paper.
Reflection direction features achieve AUROC 0.772 vs. 0.736 for final layer baseline on deepseek-llama-8b on GSM8k correctness predictionfinding0.791
Supports claim that uncertainty is encoded in reflection direction
No Reflection with 'Answer' achieves accuracy .037 on gsm8k_adv for Qwen2.5-3Bfinding0.777
Baseline accuracy when reflection is suppressed.
Spearman's rank correlation among different alignment metrics (CKA, SVCCA, Mutual k-NN, CKNNA) over 78 vision models is high across variants, with all p-values below 2.24×10^-105finding0.768
Validates robustness of alignment metric choice
Contrasting No Reflection with Triggered Reflection (µ(0→2)) provides a stronger reflection signal than contrasting Intrinsic with Triggered Reflection (µ(1→2)).claim0.766
Empirical interpretation of which reference baseline yields more useful steering vectors.
Gemma-2-27B Perspectives accuracy remains 100% after SOO fine-tuningfinding0.766
SOO fine-tuning did not collapse Gemma-2-27B self-other distinction needed for perspective-taking
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.762
Out-of-domain generalization showing deception features track general representational honesty
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.755
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.