finding
active
finding:llm-judge-deepseek-v3-agrees-with-human-evaluator-on-91-6-of-200-sampled-jailbreak-responsesLLM judge (deepseek-v3) agrees with human evaluator on 91.6% of 200 sampled jailbreak responses
Validates the LLM-based harm evaluation rubric
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Methods (1)
method
- LLM judge evaluationsupportsUsing Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Only model showing marginal benefit from increased reflection, at substantial token cost
- Five judge models agree 90-96% on multi-attempt detection and ESR direction for same responsesfinding0.775Validation that ESR findings are not artifacts of any particular judge model's evaluation methodology
- DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning (DeepSeekAI, 2025)concept0.770Paper introducing DeepSeek-R1 model and reporting self-reflection as aha moment
- Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
- One DS-v3.2 trace shows extreme self-escalation, suggestive of treating own bid as competitor.
- Easy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8kfinding0.755Baseline reflection rate for easy questions confirming difficulty-reflection correlation
- Source paper for the MT-Bench evaluation benchmark used to assess capabilities post-SOO fine-tuning
- DS-v3.2 has a high proportion of self-bidding rounds.