concept
active
concept:preference-strengthPreference Strength
The problematic possibility of digital minds with superhumanly strong preferences requiring interpersonal utility comparison frameworks
Neighborhood — ranked by edge-count
Concepts (2)
concept
- The view that well-being consists in preference satisfaction, under which inexpensive preferences and preference strength matter
- Interpersonal Utility Comparisonassociated_withThe problem of comparing welfare across agents with different utility functions, relevant to assessing preference-strength super-beneficiaries
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key element for alignment faking: model's pre-existing preferences contradict the new training objective
- Parameters controlling the influence of conditioning signals in the generative process.
- The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
- Spearman ρ measuring rank-order agreement between logit-based self-report and probe score; the paper's primary monotonic association metric
- Novel task asking which of two sentences received a stronger injection, using matched-pairs design to control for positional bias
- Alignment faking potentially making model preferences resistant to further training modification
- In active inference, the distribution over goal states; here replaced by the learned self-prior rather than a hand-specified prior