finding

active

finding:pre-trained-language-models-can-identify-harmful-vs-ethical-behavior-with-60-accuracy-using-few-shot-cot-and-classify-harm-types-above-chance

Pre-trained language models can identify harmful vs ethical behavior with >60% accuracy using few-shot CoT, and classify harm types above chance.

Figure 12 left and right show accuracy on harmful/ethical identification and 9-way classification.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Communities (2)

community

Chain-of-Thought reasoning robustness & safety
members_of
CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
Chain-of-thought reasoning across modalities
members_of
Demonstrates CoT effectiveness in multimodal contexts (vision+language) and few-shot settings, with ScienceQA as primary benchmark, circa 2023.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Language models are few-shot learners (Brown et al., 2020)concept0.768
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
Constitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.claim0.762
Discussion section suggests generalizability beyond harmlessness.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.758
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplayhypothesis0.758
Alternative hypothesis for how experience reports arise without explicit performance
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.758
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
CoT models' explicit thought paths enable intentional inconsistency between reasoning and output, a form of deception fundamentally distinct from random errorsclaim0.756
Theoretical framing establishing why CoT models are uniquely suited to exhibit strategic deception
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.756
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.756
Quantified behavioral effect showing safety score inflation from eval awareness.