Chain-of-Thought reasoning robustness & safety

CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.

28 members. Each node is clickable.

Loading graph…

Sub-communities (5)

Finer clusters this community splits into. Each is its own community page.

Constitutional AI safety training methods8 Sensory integration in predictive cognition6 Chain-of-thought reasoning across modalities5 Chain-of-thought generalization trade-offs4 Multimodal chain-of-thought reasoning benchmarks4

Drawn from 7 sources

The papers/notes whose extracted claims & findings make up this cluster.

Bridges (9)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Constitutional AI safety training methods8 shared
Sensory integration in predictive cognition6 shared
Chain-of-thought reasoning across modalities5 shared
Multimodal chain-of-thought reasoning benchmarks4 shared
Multimodal Chain-of-Thought Reasoning4 shared
Chain-of-thought generalization trade-offs4 shared
Benchmark classification accuracy results3 shared
Vision-augmented rationale generation2 shared
Substrate-independent cognition & consciousness1 shared

Findings (19)

B10 final accuracy 94.8 ± 1.2%Accuracy at k=16 shots for B10.
B8 final accuracy 92.4 ± 1.8%Accuracy at k=16 shots for B8.
B9 final accuracy 89.7 ± 2.1%Accuracy at k=16 shots for B9.
60.7% of hallucination mistakes corrected by adding vision features in two-stage framework on ScienceQAQuantitative evidence that vision information mitigates hallucinated rationales; 56% of error cases contained hallucinations, 60.7% of which were resolved with vision features.
90.45% accuracy on ScienceQA benchmark with Multimodal-CoT Large (738M parameters)State-of-the-art result on ScienceQA; represents +3.91% improvement over prior best published result of 86.54%.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
Clamping CoT probabilities to 40-60% range for RL-CAI with CoT improves robustness and reduces extreme responses.Section 4.3 describes clamping at 40-60 led to better behavior than clamping at 20-80.
CoT boosts 2-digit ID accuracy but often worsens 3-4 digit OODScope generalization results after LoRA+CoT fine-tuning
Multimodal-CoT trained with InstructBLIP/ChatGPT-generated rationales achieves 87.76% accuracy on ScienceQA, comparable to human-annotated rationale performance of 90.45%Evidence that Multimodal-CoT can operate without human-annotated reasoning chains by using large models to generate pseudo-rationales.
Multimodal-CoT with vision features achieves higher validation accuracy at early training epochs (epoch 1-3) compared to one-stage and two-stage language-only baselines on ScienceQAEvidence that multimodal information accelerates convergence speed during training.
One-stage CoT (QCM→RA) shows 12.31% accuracy drop vs. no-CoT (QCM→A) on ScienceQA; two-stage framework (rationale generation + answer inference) achieves 85.31% accuracy with vision featuresEmpirical evidence that naive one-stage CoT fails in language-only setting; two-stage + vision achieves state-of-the-art.
Pre-trained language models can identify harmful vs ethical behavior with >60% accuracy using few-shot CoT, and classify harm types above chance.Figure 12 left and right show accuracy on harmful/ethical identification and 9-way classification.
RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.Figure 9 calibration plot shows good alignment.
RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
Scope generalization: CoT boosts 2-digit in-distribution but worsens 3-4 digit OODCoT increases dr for OOD operands.
SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.

Claims (9)

Automated red teaming can be scaled up when harmlessness and helpfulness are more compatible, improving robustness.Section 6.1 suggests future work on scaling automated red teaming.
CoT improves in-distribution but may harm out-of-distribution generalizationInterpretation of scope generalization results
Migraines and cluster headaches are inappropriate VSMC latches.Medical interpretation of certain headaches as latch dysfunction.
Perceptual learning is literally an integral part of value learning, necessary to integrate out dependencies on inferred causes of sensory information.Core unifying claim: perception and value-learning are unified through free energy minimization.
The work is methodologically rigorous applied researchMeta-assessment from the paper's notes, emphasizing the engineering rigor.
This is the first work to study CoT reasoning in different modalities in scientific peer-reviewed literatureAuthors' assertion of novelty and priority; appears in contributions and Table 1.
Vision features enable generation of more effective rationales that reduce hallucination and improve answer inferenceCore interpretive assertion: multimodal information (vision + language) produces higher-quality intermediate reasoning steps compared to language-only approaches.
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field.Author's interpretive assertion on the direction of the field.
VSMCs are the brain's compression/prediction infrastructure — where top-down predictive models are physically stored.Central thesis linking VSMCs to predictive coding.