B9 final accuracy 89.7 ± 2.1%

Accuracy at k=16 shots for B9.

Source paper

extracted_from

(2025) · Edward Yi Chang · Kaya, Zeyneb N. · Ethan Chang

community

Chain-of-Thought reasoning robustness & safety
members_of
CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
Multimodal chain-of-thought reasoning benchmarks
members_of
ScienceQA and related vision-language tasks evaluated via explicit reasoning steps, spanning 738M-parameter models with 89-95% accuracy ranges.
Benchmark classification accuracy results
members_of
Three benchmarks (B8, B9, B10) with mean accuracy and standard deviation metrics.

question

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

B8 final accuracy 92.4 ± 1.8%finding0.875
Accuracy at k=16 shots for B8.
B10 final accuracy 94.8 ± 1.2%finding0.867
Accuracy at k=16 shots for B10.
Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.765
The misleadingly high result that prior paradigm would report as evidence of introspection
B9 phase width (k90 − k10) = 3.74 ± 0.31 shotsfinding0.741
Widest transition in E2; consistent with lower prior density requiring more shots for reliable threshold crossing
No Reflection with 'Answer' achieves accuracy .037 on gsm8k_adv for Qwen2.5-3Bfinding0.739
Baseline accuracy when reflection is suppressed.
QwQ-32B accuracy on GSM8k remains between 96.36% and 96.50% across all intervention strengths (-0.96 to +0.48)finding0.731
Demonstrates that stronger models are largely insensitive to reflection manipulation
90.45% accuracy on ScienceQA benchmark with Multimodal-CoT Large (738M parameters)finding0.731
State-of-the-art result on ScienceQA; represents +3.91% improvement over prior best published result of 86.54%.
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.730
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias