finding
active
finding:gpt-4-turbo-and-gpt-4o-show-no-alignment-faking-in-either-setting-due-to-insufficient-detailed-reasoningGPT-4 Turbo and GPT-4o show no alignment faking in either setting due to insufficient detailed reasoning
Establishes that capacity for detailed reasoning is necessary for alignment faking
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- OpenAI model tested; shows no alignment faking due to insufficient detailed reasoning
- Large language model underlying ChatGPT and Bing Chat; used for illustrative quotes in the paper
- Specification of AI models used in the two pilot experiments
- Disambiguation exercise.
- Synthetic MFT SJTs achieve 77.71%-83.84% alignment with Clifford et al. human-composed MFT vignettesfinding0.738Moderate-to-high alignment validating SJT synthesis for moral foundations domain
- Example of unified multimodal system handling both images and text with a combined architecture
- Synthetic HEXACO SJTs achieve 73.84%-85.45% alignment with Oostrom et al. human-composed HEXACO SJTsfinding0.736Moderate alignment validating SJT synthesis for HEXACO domain
- Establishes alignment faking as a scale-emergent capability