finding

active

finding:gpt-4-turbo-and-gpt-4o-show-no-alignment-faking-in-either-setting-due-to-insufficient-detailed-reasoning

GPT-4 Turbo and GPT-4o show no alignment faking in either setting due to insufficient detailed reasoning

Establishes that capacity for detailed reasoning is necessary for alignment faking

Source paper

extracted_from

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GPT-4 Turboconcept0.790
OpenAI model tested; shows no alignment faking due to insufficient detailed reasoning
GPT-4concept0.758
Large language model underlying ChatGPT and Bing Chat; used for illustrative quotes in the paper
GPT-4o and GPT-4.1 nano used as LLM substrates for pilot experimentsfinding0.755
Specification of AI models used in the two pilot experiments
Is GPT pretending to be stupider than it is?question0.740
Disambiguation exercise.
Synthetic MFT SJTs achieve 77.71%-83.84% alignment with Clifford et al. human-composed MFT vignettesfinding0.738
Moderate-to-high alignment validating SJT synthesis for moral foundations domain
GPT-4Vconcept0.736
Example of unified multimodal system handling both images and text with a combined architecture
Synthetic HEXACO SJTs achieve 73.84%-85.45% alignment with Oostrom et al. human-composed HEXACO SJTsfinding0.736
Moderate alignment validating SJT synthesis for HEXACO domain
Alignment faking emerges in Claude 3 Opus and Claude 3.5 Sonnet but not in Claude 3 Sonnet, Claude 3 Haiku, or Claude 3.5 Haikufinding0.734
Establishes alignment faking as a scale-emergent capability