framework
pending-review
framework:multimodal-cotMultimodal-CoT
zhang-2023-multimodal.mdFrontmatter (10 fields)
{
"doc": "zhang-2023-multimodal.md",
"context": "A two-stage framework that separates rationale generation and answer inference by incorporating vision and language modalities.",
"category": "ai",
"norm_label": "Multimodal-CoT",
"graphify_id": "multimodal_cot_framework",
"source_file": "zhang-2023-multimodal.md",
"imported_from": "/Users/antonborzov/Documents/Research.nosync/papers/extract_typed_out/zhang-2023-multimodal/graph.json",
"extracted_type": "framework",
"source_location": "§4",
"graphify_file_type": "framework"
}Outgoing (7)
Extends (1)
- Chain-of-Thought (CoT)(framework)
Implements (4)
- gated fusion(method)
- T5(framework)
- two-stage separation of rationale generation and answer inference(framework)
- Vision Transformer (ViT)(method)
Incoming (6)
about (1)
- mm-cot(artifact)
introduces (1)
- Zhuosheng Zhang(thinker)
Supported by (4)
- 90.45% accuracy on ScienceQA benchmark with Multimodal-CoT Large (738M parameters)(finding)
- Multimodal-CoT trained with InstructBLIP/ChatGPT-generated rationales achieves 87.76% accuracy on ScienceQA, comparable to human-annotated rationale performance of 90.45%(finding)
- Multimodal-CoT with vision features achieves higher validation accuracy at early training epochs (epoch 1-3) compared to one-stage and two-stage language-only baselines on ScienceQA(finding)
- This is the first work to study CoT reasoning in different modalities in scientific peer-reviewed literature(claim)
Mentions (1)
- papers-typed
zhang-2023-multimodal.md