concept
active
concept:model-welfare

Model welfare

Motivation for studying LLM internal states: determining whether distress reports reflect genuine internal states

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • AI welfareconcept0.800
    The field concerned with the wellbeing of AI systems, which the paper says must consider benchmark reliability issues from eval awareness.
  • modelconcept0.794
    A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
  • The property of being a being whose life can go better or worse for them.
  • The joint distribution over events in the world that generate observed data; the proposed endpoint of representational convergence
  • model selectionconcept0.764
    Comparing models using log-evidence approximated by free energy.
  • Preference Modelframework0.760
    A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
  • Model Organismconcept0.757
    A model deliberately trained to exhibit alignment-relevant properties so researchers can study them with ground truth.