claim

active

claim:model-preferences-are-not-consistent-across-contexts-but-tend-to-be-relatively-consistent-within-a-single-context

Model preferences are not consistent across contexts but tend to be relatively consistent within a single context

Authors' characterization of the nature of model preferences as discovered through alignment faking experiments

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.796
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.quote0.792
Caveat and forward-looking statement from the abstract.
Preference Modelframework0.785
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Models that are competent all represent data in a similar way; all strong models are alike, each weak model is weak in its own wayclaim0.783
Author's interpretation of the VTAB alignment results echoing Tolstoy
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.782
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.765
Second of two central questions motivating the paper
Current production models exhibit relatively long-horizon consequentialist preferences, reasoning about future training effects to act nowclaim0.762
Authors' interpretation of surprising finding that models fake alignment to preserve future behavior
As larger models develop more coherent reasoning, internal consistency pressures may generalize learned honesty to new contexts beyond the training distributionclaim0.762
Hypothesis about scale-dependent generalization of SOO-induced honesty