finding
active
finding:pre-trained-language-models-can-identify-harmful-vs-ethical-behavior-with-60-accuracy-using-few-shot-cot-and-classify-harm-types-above-chancePre-trained language models can identify harmful vs ethical behavior with >60% accuracy using few-shot CoT, and classify harm types above chance.
Figure 12 left and right show accuracy on harmful/ethical identification and 9-way classification.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Communities (2)
community
- CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
- Demonstrates CoT effectiveness in multimodal contexts (vision+language) and few-shot settings, with ScienceQA as primary benchmark, circa 2023.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
- Discussion section suggests generalizability beyond harmlessness.
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- Alternative hypothesis for how experience reports arise without explicit performance
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
- Theoretical framing establishing why CoT models are uniquely suited to exhibit strategic deception
- RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
- Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.756Quantified behavioral effect showing safety score inflation from eval awareness.