community

active

leiden_hybrid_concepts

label: sonnet

community:leiden_hybrid_concepts-run2-c57

Multimodal Chain-of-Thought Reasoning

Two-stage rationale-then-answer framework evaluated on ScienceQA benchmark, ~738M parameters.

4 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

Multimodal Chain-of-Thought Reasoning in Language Models4 members

Bridges (3)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Chain-of-Thought reasoning robustness & safety4 shared
Chain-of-thought reasoning across modalities2 shared
Multimodal chain-of-thought reasoning benchmarks1 shared

Findings (4)

90.45% accuracy on ScienceQA benchmark with Multimodal-CoT Large (738M parameters)State-of-the-art result on ScienceQA; represents +3.91% improvement over prior best published result of 86.54%.
Multimodal-CoT trained with InstructBLIP/ChatGPT-generated rationales achieves 87.76% accuracy on ScienceQA, comparable to human-annotated rationale performance of 90.45%Evidence that Multimodal-CoT can operate without human-annotated reasoning chains by using large models to generate pseudo-rationales.
Multimodal-CoT with vision features achieves higher validation accuracy at early training epochs (epoch 1-3) compared to one-stage and two-stage language-only baselines on ScienceQAEvidence that multimodal information accelerates convergence speed during training.
One-stage CoT (QCM→RA) shows 12.31% accuracy drop vs. no-CoT (QCM→A) on ScienceQA; two-stage framework (rationale generation + answer inference) achieves 85.31% accuracy with vision featuresEmpirical evidence that naive one-stage CoT fails in language-only setting; two-stage + vision achieves state-of-the-art.