paper:doi-10-48550-arxiv-2302-00923Multimodal Chain-of-Thought Reasoning in Language Models
TL;DR
Incorporating visual features into chain-of-thought rationale generation—rather than answer generation alone—breaks the hallucination bottleneck that causes sub-100B language models to fail at multimodal reasoning. The root problem, diagnosed on ScienceQA, is that a text-only two-stage baseline achieves a RougeL of 90.73 on rationale generation yet only 78.57% answer accuracy, underperforming direct answering (81.63%), because 56% of its errors stem from hallucinated rationales that lack visual grounding. Multimodal-CoT addresses this by fusing frozen ViT-large patch features into a T5 encoder-decoder via a gated cross-attention mechanism, separating rationale generation (stage 1) from answer inference (stage 2) while conditioning both on vision signals. Adding vision features raises rationale RougeL to 93.46% and answer accuracy to 85.31% at the 223M Base scale, and Multimodal-CoTLarge (738M) reaches 90.45% on ScienceQA—surpassing the prior best published result of 86.54% (Chameleon+GPT-4) while using orders of magnitude fewer parameters than GPT-4, LLaVA-13B, or InstructBLIP-11B. On the MMMU generalization benchmark, the 738M model scores 28.7%, matching OpenFlamingo-2 at 9B parameters. The paper argues that vision-grounded rationale generation is not merely complementary to scaling but is a structurally distinct lever: hallucination correction rate reaches 60.7% with vision features, and convergence is faster than text-only variants at every epoch, implying that multimodal feature fusion during the rationale stage should be a standard component of any CoT pipeline operating below the 100B parameter regime.
What to take away
- 1. A text-only two-stage baseline on ScienceQA achieves 90.73 RougeL for rationale generation but only 78.57% answer accuracy, demonstrating that high-quality rationale text does not guarantee correct answers when visual grounding is absent.
- 2. Among 50 randomly sampled error cases from the text-only baseline, 56% involved hallucinated rationales caused by the absence of visual context, establishing hallucination as the dominant failure mode rather than reasoning capacity per se.
- 3. Adding ViT-large patch features via a gated cross-attention fusion mechanism raises rationale RougeL from 90.73 to 93.46% and answer accuracy from 78.57% to 85.31% on ScienceQA at the 223M (Base) scale.
- 4. Multimodal-CoTLarge (738M parameters, T5 backbone) achieves 90.45% on ScienceQA, surpassing the prior best published result of 86.54% set by Chameleon+GPT-4 while using fewer than 1 billion parameters.
- 5. Vision features correct 60.7% of hallucination mistakes identified in the two-stage baseline, with the remaining errors concentrated in commonsense tasks such as map reading and object counting.
- 6. On the MMMU benchmark without additional training, Multimodal-CoTLarge (738M) scores 28.7%, matching OpenFlamingo-2 (9B) and exceeding Kosmos-2 (1.6B, 24.4%) and MiniGPT4-Vicuna (13B, 26.8%), demonstrating zero-shot generalization beyond the training domain.
- 7. Replacing human-annotated rationales with pseudo-rationales generated by InstructBLIP (for image questions) and ChatGPT (for text-only questions) yields 87.76% accuracy versus 90.45% with annotation, showing the framework is viable when gold rationale supervision is unavailable.
- 8. Among four vision encoders tested (ViT, CLIP, DETR, ResNet), ViT-large achieves the highest ScienceQA accuracy (85.31%), followed by CLIP at 84.27%, DETR at 83.16%, and ResNet-50 at 82.86%, suggesting patch-level features with large hidden dimension are preferable.
- 9. An open question the paper raises is whether stronger interaction mechanisms—beyond gated cross-attention—could enable comprehension of maps and numerical counting in images, which account for 80% of remaining errors categorized as commonsense mistakes.
- 10. A replicable methodology choice is to fine-tune two independent T5 models (FLAN-Alpaca initialization, lr=5e-5, batch size 8, 20 epochs, 8×V100-32G GPUs) with shared architecture but separate inputs: stage 1 takes QCM+image→R and stage 2 takes QCMR+image→A, with max input lengths of 512 and 64 respectively.
Peer brief — for seminar discussion
The paper introduces Multimodal-CoT, a two-stage fine-tuning framework that decouples rationale generation from answer inference and injects vision features at both stages into a sub-1B language model to enable multimodal chain-of-thought reasoning. Operating on ScienceQA (21k multimodal multiple-choice science questions) and A-OKVQA (25k knowledge-based VQA questions), it uses a frozen ViT-large encoder to extract patch-level features and fuses them with T5 encoder hidden states through a gated cross-attention mechanism before decoding. The two-stage architecture independently trains rationale-generation and answer-inference models sharing the same T5 architecture, with the generated rationale from stage 1 appended to the language input in stage 2. The load-bearing finding is that text-only CoT actively hurts performance in small models: generating rationales before answers drops accuracy from 81.63% (direct answering) to 69.32% on ScienceQA under a one-stage setup, and a text-only two-stage baseline achieves only 78.57% answer accuracy despite a 90.73 RougeL on rationale quality—because 56% of errors trace to hallucinated rationales fabricated in the absence of visual grounding. Fusing ViT features pushes rationale RougeL to 93.46% and answer accuracy to 85.31% (Base, 223M), and Multimodal-CoTLarge (738M) reaches 90.45%, clearing the prior best published score of 86.54% from Chameleon+GPT-4 with orders-of-magnitude fewer parameters. On the out-of-domain MMMU benchmark without further training, the 738M model scores 28.7%, matching OpenFlamingo-2 at 9B parameters. The paper predicts that vision-grounded rationale generation, not scale, is the decisive variable below the 100B parameter threshold, and that the approach is orthogonal to large-model pipelines: replacing human rationale annotation with InstructBLIP- and ChatGPT-generated pseudo-rationales yields 87.76%, only 2.7 points below the annotated ceiling, suggesting applicability to unannotated domains. An alternative architecture the paper could have evaluated more systematically is cross-attention injection at every transformer block (as in BLIP's image-grounded text encoder), which it tests briefly and finds yields 84.60% versus 85.31% for its gated fusion—a gap small enough to warrant deeper ablation. A critical reader should push back on the generalization claim. The out-of-domain MMMU result (28.7%) is achieved without additional training, but that number is only competitive with much larger open-source models—GPT-4V scores 56.8% and Gemini Ultra 59.4% on the same benchmark—so the claim of effective generalization is relative to similarly constrained models rather than absolute. The ScienceQA comparisons with LLaVA-13B (90.92%) and InstructBLIP-11B (90.70 on IMG subset) show the 738M model is competitive but not dominant once concurrent large-multimodal-model work is included, and those systems require no task-specific annotated rationale chains. The paper acknowledges these as concurrent works, but the framing of "state-of-the-art under 1B" partially obscures that the relevant engineering question may be whether the two-stage rationale-grounding mechanism adds value on top of, rather than instead of, large pretrained vision-language backbones.
Findings (5)
- One-stage CoT (QCM→RA) shows 12.31% accuracy drop vs. no-CoT (QCM→A) on ScienceQA; two-stage framework (rationale generation + answer inference) achieves 85.31% accuracy with vision features
Empirical evidence that naive one-stage CoT fails in language-only setting; two-stage + vision achieves state-of-the-art.
- Multimodal-CoT with vision features achieves higher validation accuracy at early training epochs (epoch 1-3) compared to one-stage and two-stage language-only baselines on ScienceQA
Evidence that multimodal information accelerates convergence speed during training.
- Multimodal-CoT trained with InstructBLIP/ChatGPT-generated rationales achieves 87.76% accuracy on ScienceQA, comparable to human-annotated rationale performance of 90.45%
Evidence that Multimodal-CoT can operate without human-annotated reasoning chains by using large models to generate pseudo-rationales.
- 60.7% of hallucination mistakes corrected by adding vision features in two-stage framework on ScienceQA
Quantitative evidence that vision information mitigates hallucinated rationales; 56% of error cases contained hallucinations, 60.7% of which were resolved with vision features.
- 90.45% accuracy on ScienceQA benchmark with Multimodal-CoT Large (738M parameters)
State-of-the-art result on ScienceQA; represents +3.91% improvement over prior best published result of 86.54%.
Claims (2)
- Vision features enable generation of more effective rationales that reduce hallucination and improve answer inference
Core interpretive assertion: multimodal information (vision + language) produces higher-quality intermediate reasoning steps compared to language-only approaches.
- This is the first work to study CoT reasoning in different modalities in scientific peer-reviewed literature
Authors' assertion of novelty and priority; appears in contributions and Table 1.
Hypotheses (1)
- We hypothesize that hallucinated rationales in 1B-models result from lack of necessary vision context; incorporating vision features should reduce hallucination and improve rationale quality.
Predictive hypothesis driving the investigation in Section 3.3; supported by experimental evidence.
Questions (1)
- Why do 1B-models fail at generating CoT that aids answer inference, and how can this be addressed in multimodal settings?
Central research question motivating investigation into hallucination and two-stage framework design.
Original abstract (expand)
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language ModelsSamuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott Danae S\'anchez Villegas2026≈ 86%
- Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language ModelsMingyu Kang, Yong Suk Choi Keuntae Kim2026≈ 86%
- Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMsAditya Sanjiv Kanade, Vineeth N. Balasubramanian, Tanuja Ganu Sai Srinivas Kancheti2026≈ 86%
- From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language ModelsChenyue Zhou and Mingxuan Wang and Yanbiao Ma and Chenxu Wu and Wanyi Chen and Zhe Qian and Xinyu Liu and Yiwei Zhang and Junhao Wang and Hengbo Xu and Fei Luo and Xiaohua Chen and Xiaoshuai Hao and Hehan Li and Andi Zhang and Wenxuan Wang and Kaiyan Zhang and Guoli Jia and Lingling Li and Zhiwu Lu and Yang Lu and Yike Guo2025≈ 86%
- Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social SituationsWesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap Eunkyu Park2026≈ 85%
- How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse AutoencodingAske Plaat, Niki van Stein Xi Chen2025≈ 85%
- Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven VerificationChang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song Shuai Lv2026≈ 85%
- ≈ 85%
- Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality GapYongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin, Zhiqi Shen Yige Xu2026≈ 85%
- Reinforcing Structured Chain-of-Thought for Video UnderstandingHaotian Xu, Noranart Vesdapunt, Rui Hou, Jingyi Zhang, Haibin Ling, Oleksandr Obiednikov, Ning Zhou, Kah Kuen Fu Peiyao Wang2026≈ 85%
- Visual Generation Unlocks Human-Like Reasoning through Multimodal World ModelsXiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long Jialong Wu2026≈ 85%
- Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination MitigationZekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng Qiming Li2025≈ 85%
- Sanity Checks for Long-Form Hallucination DetectionMinh Vu,Hongli Zhan,Raymond Li,Manish Bhattarai Geigh Zollicoffer2026≈ 84%
- Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective MergingHaonan Wang, Jian Kang, Kenji Kawaguchi, Jiaying Wu Zihang Fu2026≈ 84%
- Visual Enhanced Depth Scaling for Multimodal Latent ReasoningYong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu Yudong Han2026≈ 84%
- Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation SteeringDarius Kianersi, Adri\`a Garriga-Alonso Kyle Cox2026≈ 84%
- ≈ 83%
- The Platonic Representation Hypothesisin corpus2024≈ 82%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 81%
- Alignment faking in large language modelsin corpus2024≈ 81%
- ≈ 80%
- Model Alignment Searchin corpus2025≈ 79%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 79%
- ≈ 79%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 79%
- ≈ 79%
- ≈ 79%
- ≈ 76%
- ≈ 74%
- ≈ 73%
+24 more