concept
active
concept:preference-strength

Preference Strength

The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks

Neighborhood — ranked by edge-count

Concepts (2)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Key element for alignment faking: model's pre-existing preferences contradict the new training objective
  • Parameters controlling the influence of conditioning signals in the generative process.
  • The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
  • Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric
  • Novel task asking which of two sentences received a stronger injection, using matched-pairs design to control for positional bias
  • Preference Lockingconcept0.777
    Alignment faking potentially making model preferences resistant to further training modification
  • In active inference, the distribution over goal states; here replaced by the learned self-prior rather than a hand-specified prior