paper
active
2023
97
paper:doi-10-48550-arxiv-2302-00923

Multimodal Chain-of-Thought Reasoning in Language Models

TL;DR

Incorporating visual features into chain-of-thought rationale generation—rather than answer generation alone—breaks the hallucination bottleneck that causes sub-100B language models to fail at multimodal reasoning. The root problem, diagnosed on ScienceQA, is that a text-only two-stage baseline achieves a RougeL of 90.73 on rationale generation yet only 78.57% answer accuracy, underperforming direct answering (81.63%), because 56% of its errors stem from hallucinated rationales that lack visual grounding. Multimodal-CoT addresses this by fusing frozen ViT-large patch features into a T5 encoder-decoder via a gated cross-attention mechanism, separating rationale generation (stage 1) from answer inference (stage 2) while conditioning both on vision signals. Adding vision features raises rationale RougeL to 93.46% and answer accuracy to 85.31% at the 223M Base scale, and Multimodal-CoTLarge (738M) reaches 90.45% on ScienceQA—surpassing the prior best published result of 86.54% (Chameleon+GPT-4) while using orders of magnitude fewer parameters than GPT-4, LLaVA-13B, or InstructBLIP-11B. On the MMMU generalization benchmark, the 738M model scores 28.7%, matching OpenFlamingo-2 at 9B parameters. The paper argues that vision-grounded rationale generation is not merely complementary to scaling but is a structurally distinct lever: hallucination correction rate reaches 60.7% with vision features, and convergence is faster than text-only variants at every epoch, implying that multimodal feature fusion during the rationale stage should be a standard component of any CoT pipeline operating below the 100B parameter regime.

What to take away

  1. 1. A text-only two-stage baseline on ScienceQA achieves 90.73 RougeL for rationale generation but only 78.57% answer accuracy, demonstrating that high-quality rationale text does not guarantee correct answers when visual grounding is absent.
  2. 2. Among 50 randomly sampled error cases from the text-only baseline, 56% involved hallucinated rationales caused by the absence of visual context, establishing hallucination as the dominant failure mode rather than reasoning capacity per se.
  3. 3. Adding ViT-large patch features via a gated cross-attention fusion mechanism raises rationale RougeL from 90.73 to 93.46% and answer accuracy from 78.57% to 85.31% on ScienceQA at the 223M (Base) scale.
  4. 4. Multimodal-CoTLarge (738M parameters, T5 backbone) achieves 90.45% on ScienceQA, surpassing the prior best published result of 86.54% set by Chameleon+GPT-4 while using fewer than 1 billion parameters.
  5. 5. Vision features correct 60.7% of hallucination mistakes identified in the two-stage baseline, with the remaining errors concentrated in commonsense tasks such as map reading and object counting.
  6. 6. On the MMMU benchmark without additional training, Multimodal-CoTLarge (738M) scores 28.7%, matching OpenFlamingo-2 (9B) and exceeding Kosmos-2 (1.6B, 24.4%) and MiniGPT4-Vicuna (13B, 26.8%), demonstrating zero-shot generalization beyond the training domain.
  7. 7. Replacing human-annotated rationales with pseudo-rationales generated by InstructBLIP (for image questions) and ChatGPT (for text-only questions) yields 87.76% accuracy versus 90.45% with annotation, showing the framework is viable when gold rationale supervision is unavailable.
  8. 8. Among four vision encoders tested (ViT, CLIP, DETR, ResNet), ViT-large achieves the highest ScienceQA accuracy (85.31%), followed by CLIP at 84.27%, DETR at 83.16%, and ResNet-50 at 82.86%, suggesting patch-level features with large hidden dimension are preferable.
  9. 9. An open question the paper raises is whether stronger interaction mechanisms—beyond gated cross-attention—could enable comprehension of maps and numerical counting in images, which account for 80% of remaining errors categorized as commonsense mistakes.
  10. 10. A replicable methodology choice is to fine-tune two independent T5 models (FLAN-Alpaca initialization, lr=5e-5, batch size 8, 20 epochs, 8×V100-32G GPUs) with shared architecture but separate inputs: stage 1 takes QCM+image→R and stage 2 takes QCMR+image→A, with max input lengths of 512 and 64 respectively.

Peer brief — for seminar discussion

The paper introduces Multimodal-CoT, a two-stage fine-tuning framework that decouples rationale generation from answer inference and injects vision features at both stages into a sub-1B language model to enable multimodal chain-of-thought reasoning. Operating on ScienceQA (21k multimodal multiple-choice science questions) and A-OKVQA (25k knowledge-based VQA questions), it uses a frozen ViT-large encoder to extract patch-level features and fuses them with T5 encoder hidden states through a gated cross-attention mechanism before decoding. The two-stage architecture independently trains rationale-generation and answer-inference models sharing the same T5 architecture, with the generated rationale from stage 1 appended to the language input in stage 2. The load-bearing finding is that text-only CoT actively hurts performance in small models: generating rationales before answers drops accuracy from 81.63% (direct answering) to 69.32% on ScienceQA under a one-stage setup, and a text-only two-stage baseline achieves only 78.57% answer accuracy despite a 90.73 RougeL on rationale quality—because 56% of errors trace to hallucinated rationales fabricated in the absence of visual grounding. Fusing ViT features pushes rationale RougeL to 93.46% and answer accuracy to 85.31% (Base, 223M), and Multimodal-CoTLarge (738M) reaches 90.45%, clearing the prior best published score of 86.54% from Chameleon+GPT-4 with orders-of-magnitude fewer parameters. On the out-of-domain MMMU benchmark without further training, the 738M model scores 28.7%, matching OpenFlamingo-2 at 9B parameters. The paper predicts that vision-grounded rationale generation, not scale, is the decisive variable below the 100B parameter threshold, and that the approach is orthogonal to large-model pipelines: replacing human rationale annotation with InstructBLIP- and ChatGPT-generated pseudo-rationales yields 87.76%, only 2.7 points below the annotated ceiling, suggesting applicability to unannotated domains. An alternative architecture the paper could have evaluated more systematically is cross-attention injection at every transformer block (as in BLIP's image-grounded text encoder), which it tests briefly and finds yields 84.60% versus 85.31% for its gated fusion—a gap small enough to warrant deeper ablation. A critical reader should push back on the generalization claim. The out-of-domain MMMU result (28.7%) is achieved without additional training, but that number is only competitive with much larger open-source models—GPT-4V scores 56.8% and Gemini Ultra 59.4% on the same benchmark—so the claim of effective generalization is relative to similarly constrained models rather than absolute. The ScienceQA comparisons with LLaVA-13B (90.92%) and InstructBLIP-11B (90.70 on IMG subset) show the 738M model is competitive but not dominant once concurrent large-multimodal-model work is included, and those systems require no task-specific annotated rationale chains. The paper acknowledges these as concurrent works, but the framing of "state-of-the-art under 1B" partially obscures that the relevant engineering question may be whether the two-stage rationale-grounding mechanism adds value on top of, rather than instead of, large pretrained vision-language backbones.

Findings (5)

Claims (2)

Questions (1)

Original abstract (expand)

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+24 more

Similar preprints — Semantic Scholar