claim
active
claim:model-preferences-are-not-consistent-across-contexts-but-tend-to-be-relatively-consistent-within-a-single-contextModel preferences are not consistent across contexts but tend to be relatively consistent within a single context
Authors' characterization of the nature of model preferences as discovered through alignment faking experiments
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.796Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- Caveat and forward-looking statement from the abstract.
- A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
- Author's interpretation of the VTAB alignment results echoing Tolstoy
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.782Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
- Second of two central questions motivating the paper
- Authors' interpretation of surprising finding that models fake alignment to preserve future behavior
- Hypothesis about scale-dependent generalization of SOO-induced honesty