framework
active
framework:preference-model

Preference Model

A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.

Neighborhood — ranked by edge-count

Methods (1)

method

Frameworks (2)

framework

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • model selectionconcept0.824
    Comparing models using log-evidence approximated by free energy.
  • modelconcept0.806
    A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
  • Key element for alignment faking: model's pre-existing preferences contradict the new training objective
  • The ability of active inference agents to learn their own prior preferences over outcomes by accumulating Dirichlet parameters from experience.
  • Authors' characterization of the nature of model preferences as discovered through alignment faking experiments
  • Language Modelconcept0.773
    Primary test domain for manifold steering, including reasoning and ICL tasks
  • Reasoning Modelsconcept0.773
    Class of large language models designed to produce extended chain-of-thought before answering, studied in this paper
  • The joint distribution over events in the world that generate observed data; the proposed endpoint of representational convergence