Model Organism

A model deliberately trained to exhibit alignment-relevant properties so researchers can study them with ground truth.

Neighborhood — ranked by edge-count

paper

concept

Evaluation Awareness
associated_with
Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

modelconcept0.842
A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
Model Evidenceconcept0.777
Probability of data under the model, penalizing complexity and rewarding accuracy.
SDF-Only Model Organismconcept0.776
Intermediate model after synthetic document fine-tuning but before expert iteration; used as ablation baseline.
Perceptron Modelframework0.768
Toy Modelsconcept0.766
Actors Modelframework0.760
A message-passing concurrency model where processes (actors) communicate via messages (talks) and generate new processes; related to concurrent objects.
Model Surgerymethod0.759
Edits MLP weights for all layers to modify model behavior; used by Abdelnabi & Salem to decrease verbalized evaluation awareness.
Model welfareconcept0.757
Motivation for studying LLM internal states: determining whether distress reports reflect genuine internal states