community
active
leiden_hybrid_concepts
label: sonnet
community:leiden_hybrid_concepts-run2-c57Multimodal Chain-of-Thought Reasoning
Two-stage rationale-then-answer framework evaluated on ScienceQA benchmark, ~738M parameters.
4 members. Each node is clickable.
Loading graph…
Drawn from 1 source
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (3)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Findings (4)
- 90.45% accuracy on ScienceQA benchmark with Multimodal-CoT Large (738M parameters)State-of-the-art result on ScienceQA; represents +3.91% improvement over prior best published result of 86.54%.
- Multimodal-CoT trained with InstructBLIP/ChatGPT-generated rationales achieves 87.76% accuracy on ScienceQA, comparable to human-annotated rationale performance of 90.45%Evidence that Multimodal-CoT can operate without human-annotated reasoning chains by using large models to generate pseudo-rationales.
- Multimodal-CoT with vision features achieves higher validation accuracy at early training epochs (epoch 1-3) compared to one-stage and two-stage language-only baselines on ScienceQAEvidence that multimodal information accelerates convergence speed during training.
- One-stage CoT (QCM→RA) shows 12.31% accuracy drop vs. no-CoT (QCM→A) on ScienceQA; two-stage framework (rationale generation + answer inference) achieves 85.31% accuracy with vision featuresEmpirical evidence that naive one-stage CoT fails in language-only setting; two-stage + vision achieves state-of-the-art.