concept
active
concept:persona-sampling-hypothesis

Persona Sampling Hypothesis

Hypothesis that LLM is sampling from distribution of personas; a consistent fraction of which align-fake, explaining correlation between AF reasoning and compliance gap

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Alignment Faking
    associated_with
    Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.