thinker
active
thinker:pei-yuan-wu

Pei-Yuan Wu

Senior co-author, affiliated with NTU and AINTU.

Authored
1
Introduces
0
Studies
0
Affiliations
2
Cited by
0

Authored papers (1)

  • Reflection in LLMs corresponds to a recoverable latent direction in activation space, not merely a behavioral artifact of prompt engineering. Working with Qwen2.5-3B and Gemma3-4B-IT on the adversarial benchmarks gsm8k_adv and cruxeval_o_adv, the paper demonstrates a clear three-way stratification in exact-match accuracy: triggered reflection (average 0.397 / 0.586 on gsm8k_adv for the two models) substantially outperforms intrinsic reflection (0.295 / 0.335) and no-reflection conditions (0.051 / 0.147). The central method introduced is contrastive activation steering, in which steering vectors are computed as mean activation differences at the appended-instruction token position across reflection level pairs (e.g., µ⁰→² from No Reflection to Triggered Reflection), then used either to rank candidate trigger tokens by cosine similarity or to directly intervene in residual-stream activations at a chosen layer ℓ. Using layer-12 steering vectors, previously unreported tokens such as "However," "Oops," and "Validate" are identified as effective reflection triggers with cosine similarities up to 0.978 against the canonical steering direction, and these tokens achieve accuracy competitive with established triggers like "Wait" and "Alternatively." Crucially, inhibition interventions produce larger accuracy drops than enhancement interventions produce accuracy gains — suppressing reflection is mechanistically easier than inducing it — which the paper argues implies a concrete adversarial risk: jailbreak-style prefix attacks that append high-certainty continuations effectively exploit this asymmetry to disable internal safety-checking, and future defenses should target the identified reflection direction directly.

More papers — OpenAlex / S2

Recent mentions (1)