community
active
leiden_hybrid_concepts
label: sonnet
community:leiden_hybrid_concepts-run4-c13Chain-of-Thought reasoning robustness & safety
CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
28 members. Each node is clickable.
Loading graph…
Sub-communities (5)
Finer clusters this community splits into. Each is its own community page.
Drawn from 7 sources
The papers/notes whose extracted claims & findings make up this cluster.
- CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence10 members
- Multimodal Chain-of-Thought Reasoning in Language Models7 members
- The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring6 members
- Johnson Vasocomputation 20232 members
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training1 member
- guo-atlas-2026.md1 member
- A Free energy principle for the brain (lecture summary)1 member
Bridges (9)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
- Constitutional AI safety training methods8 shared
- Sensory integration in predictive cognition6 shared
- Chain-of-thought reasoning across modalities5 shared
- Multimodal chain-of-thought reasoning benchmarks4 shared
- Multimodal Chain-of-Thought Reasoning4 shared
- Chain-of-thought generalization trade-offs4 shared
- Benchmark classification accuracy results3 shared
- Vision-augmented rationale generation2 shared
- Substrate-independent cognition & consciousness1 shared
Findings (19)
- B10 final accuracy 94.8 ± 1.2%Accuracy at k=16 shots for B10.
- B8 final accuracy 92.4 ± 1.8%Accuracy at k=16 shots for B8.
- B9 final accuracy 89.7 ± 2.1%Accuracy at k=16 shots for B9.
- 60.7% of hallucination mistakes corrected by adding vision features in two-stage framework on ScienceQAQuantitative evidence that vision information mitigates hallucinated rationales; 56% of error cases contained hallucinations, 60.7% of which were resolved with vision features.
- 90.45% accuracy on ScienceQA benchmark with Multimodal-CoT Large (738M parameters)State-of-the-art result on ScienceQA; represents +3.91% improvement over prior best published result of 86.54%.
- Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- Clamping CoT probabilities to 40-60% range for RL-CAI with CoT improves robustness and reduces extreme responses.Section 4.3 describes clamping at 40-60 led to better behavior than clamping at 20-80.
- CoT boosts 2-digit ID accuracy but often worsens 3-4 digit OODScope generalization results after LoRA+CoT fine-tuning
- Multimodal-CoT trained with InstructBLIP/ChatGPT-generated rationales achieves 87.76% accuracy on ScienceQA, comparable to human-annotated rationale performance of 90.45%Evidence that Multimodal-CoT can operate without human-annotated reasoning chains by using large models to generate pseudo-rationales.
- Multimodal-CoT with vision features achieves higher validation accuracy at early training epochs (epoch 1-3) compared to one-stage and two-stage language-only baselines on ScienceQAEvidence that multimodal information accelerates convergence speed during training.
- One-stage CoT (QCM→RA) shows 12.31% accuracy drop vs. no-CoT (QCM→A) on ScienceQA; two-stage framework (rationale generation + answer inference) achieves 85.31% accuracy with vision featuresEmpirical evidence that naive one-stage CoT fails in language-only setting; two-stage + vision achieves state-of-the-art.
- Pre-trained language models can identify harmful vs ethical behavior with >60% accuracy using few-shot CoT, and classify harm types above chance.Figure 12 left and right show accuracy on harmful/ethical identification and 9-way classification.
- RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.Figure 9 calibration plot shows good alignment.
- RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
- RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
- RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
- Scope generalization: CoT boosts 2-digit in-distribution but worsens 3-4 digit OODCoT increases dr for OOD operands.
- SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.
- SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
Claims (9)
- Automated red teaming can be scaled up when harmlessness and helpfulness are more compatible, improving robustness.Section 6.1 suggests future work on scaling automated red teaming.
- CoT improves in-distribution but may harm out-of-distribution generalizationInterpretation of scope generalization results
- Migraines and cluster headaches are inappropriate VSMC latches.Medical interpretation of certain headaches as latch dysfunction.
- Perceptual learning is literally an integral part of value learning, necessary to integrate out dependencies on inferred causes of sensory information.Core unifying claim: perception and value-learning are unified through free energy minimization.
- The work is methodologically rigorous applied researchMeta-assessment from the paper's notes, emphasizing the engineering rigor.
- This is the first work to study CoT reasoning in different modalities in scientific peer-reviewed literatureAuthors' assertion of novelty and priority; appears in contributions and Table 1.
- Vision features enable generation of more effective rationales that reduce hallucination and improve answer inferenceCore interpretive assertion: multimodal information (vision + language) produces higher-quality intermediate reasoning steps compared to language-only approaches.
- Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field.Author's interpretive assertion on the direction of the field.
- VSMCs are the brain's compression/prediction infrastructure — where top-down predictive models are physically stored.Central thesis linking VSMCs to predictive coding.