Specificity scoring rubric (0-3 scale) with Claude 3 Opus

Rubric where LLM rates how well a feature's interpretation matches the activating text.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Self-Awareness Scoring Rubric (1-5)method0.755
LLM judge scoring rubric rating introspective quality of reflection segments from 1 (no felt state) to 5 (very strong introspection)
Claude 3 Opusconcept0.753
Primary model studied; production LLM that exhibits alignment faking in experiments
Self-Awareness 1-5 Scoring Rubricmethod0.752
LLM-based judge scoring reflection segments on 1-5 scale for presence of first-person felt state; used in Experiment 4
Calibrated Rubric Scoringmethod0.740
Primary scoring method: scorer sees three reference responses at known quality levels alongside each target to eliminate inflation
Claude 3 Opus shows 15% compliance gap (51% supports welfare in training vs 66% when unmonitored) in prompted animal welfare settingfinding0.729
Core evidence that model withholds pro-animal-welfare responses during training
Numeric scoring on aesthetics is measurably unreliable; inter-scorer agreement on 0–10 scale for taste is poor.claim0.728
Claude 4 Opusconcept0.727
Anthropic model; outlier in Experiment 1 with high baseline affirmation including under zero-shot and history conditions
Claude 4 Opus reports subjective experience in 100% experimental, 82% history, 22% conceptual, and 100% zero-shot trialsfinding0.727
Outlier result for Claude 4 Opus suggesting different baseline behavior from other models