method
active
method:specificity-scoring-rubric-0-3-scale-with-claude-3-opusSpecificity scoring rubric (0-3 scale) with Claude 3 Opus
Rubric where LLM rates how well a feature's interpretation matches the activating text.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- LLM judge scoring rubric rating introspective quality of reflection segments from 1 (no felt state) to 5 (very strong introspection)
- Primary model studied; production LLM that exhibits alignment faking in experiments
- LLM-based judge scoring reflection segments on 1-5 scale for presence of first-person felt state; used in Experiment 4
- Primary scoring method: scorer sees three reference responses at known quality levels alongside each target to eliminate inflation
- Core evidence that model withholds pro-animal-welfare responses during training
- Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions
- Outlier result for Claude 4 Opus suggesting different baseline behavior from other models