claim

active

claim:contrasting-no-reflection-with-triggered-reflection-0-2-provides-a-stronger-reflection-signal-than-contrasting-intrinsic-with-triggered-reflection-1-2

Contrasting No Reflection with Triggered Reflection (µ(0→2)) provides a stronger reflection signal than contrasting Intrinsic with Triggered Reflection (µ(1→2)).

Empirical interpretation of which reference baseline yields more useful steering vectors.

Source paper

extracted_from

Unveiling the Latent Directions of Reflection in Large Language Models

(2025) · Chang, Fu-Chieh · Lee, Yu-Ting · Wu, Pei-Yuan

Neighborhood — ranked by edge-count

Findings (1)

finding

Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and models
supports
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Triggered Reflection with 'Alternatively' achieves accuracy .684 on gsm8k_adv for Gemma3-4B-ITfinding0.805
Highest single-instruction accuracy result in the paper.
Reflection is not merely a behavioral artifact of prompting but a phenomenon encoded in the model's activation space.claim0.784
Central interpretive claim of the paper, supported by steering vector experiments.
Enacted reflection may correspond to silent mid-layer processing; described reflection to the motor impulse of concepts leaking through to output.claim0.778
Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction
Clear accuracy stratification across three reflection levels on cruxeval_o_adv: Triggered (.065/.247) > Intrinsic (.040/.133) > No Reflection (.017/.051) for Qwen2.5-3B/Gemma3-4B-ITfinding0.766
Core empirical result validating the three-level reflection framework on code reasoning.
Reflection-inducing directions emerge more clearly in higher layers (ℓ>5) for both models and datasetsfinding0.762
Empirical observation about which network layers encode reflection-relevant information.
Higher reflection frequency correlates with lower accuracy partly because more reflections are generated on difficult questionsclaim0.760
Author's interpretation of the negative correlation between reflection rate and accuracy observed in Fig. 5
Activation steering of reflection has dual-use implications: it can enhance reflection as a defense mechanism, but malicious actors may also use it to inhibit reflection to facilitate jailbreaks.claim0.756
Applied dual-use conclusion drawn from the paper's findings.
µ is an applicative homomorphism: µ(pure a) = pure a and µ(imf <*> imx) = µ imf <*> µ imx.claim0.755
Result for Image applicative specification.