finding
active
finding:b9-final-accuracy-89-7-2-1B9 final accuracy 89.7 ± 2.1%
Accuracy at k=16 shots for B9.
Source paper
extracted_from(2025) · Edward Yi Chang · Kaya, Zeyneb N. · Ethan Chang
Neighborhood — ranked by edge-count
Communities (3)
community
- CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
- ScienceQA and related vision-language tasks evaluated via explicit reasoning steps, spanning 738M-parameter models with 89-95% accuracy ranges.
- Three benchmarks (B8, B9, B10) with mean accuracy and standard deviation metrics.
Questions (1)
question
- Second research question in E2
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Accuracy at k=16 shots for B8.
- Accuracy at k=16 shots for B10.
- Binary detection adjusted accuracy reaches 97.3% at layer 0 with α=5 before baseline control is appliedfinding0.765The misleadingly high result that prior paradigm would report as evidence of introspection
- Widest transition in E2; consistent with lower prior density requiring more shots for reliable threshold crossing
- Baseline accuracy when reflection is suppressed.
- Demonstrates that stronger models are largely insensitive to reflection manipulation
- State-of-the-art result on ScienceQA; represents +3.91% improvement over prior best published result of 86.54%.
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias