Model welfare

Motivation for studying LLM internal states: determining whether distress reports reflect genuine internal states

Neighborhood — ranked by edge-count

paper

community

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

AI welfareconcept0.800
The field concerned with the wellbeing of AI systems, which the paper says must consider benchmark reliability issues from eval awareness.
modelconcept0.794
A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
Welfare Subjectivityconcept0.783
The property of being a being whose life can go better or worse for them.
World Model (statistical)concept0.768
The joint distribution over events in the world that generate observed data; the proposed endpoint of representational convergence
Model welfare is now mainstream concern, dragged from fringe by frontier model leadership.claim0.766
model selectionconcept0.764
Comparing models using log-evidence approximated by free energy.
Preference Modelframework0.760
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Model Organismconcept0.757
A model deliberately trained to exhibit alignment-relevant properties so researchers can study them with ground truth.