concept
active
concept:model-organismModel Organism
A model deliberately trained to exhibit alignment-relevant properties so researchers can study them with ground truth.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Evaluation Awarenessassociated_withCore concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
- Probability of data under the model, penalizing complexity and rewarding accuracy.
- Intermediate model after synthetic document fine-tuning but before expert iteration; used as ablation baseline.
- A message-passing concurrency model where processes (actors) communicate via messages (talks) and generate new processes; related to concurrent objects.
- Edits MLP weights for all layers to modify model behavior; used by Abdelnabi & Salem to decrease verbalized evaluation awareness.
- Motivation for studying LLM internal states: determining whether distress reports reflect genuine internal states