finding

active

finding:llm-judge-deepseek-v3-agrees-with-human-evaluator-on-91-6-of-200-sampled-jailbreak-responses

LLM judge (deepseek-v3) agrees with human evaluator on 91.6% of 200 sampled jailbreak responses

Validates the LLM-based harm evaluation rubric

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Methods (1)

method

LLM judge evaluation
supports
Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

DeepSeek-R1 Llama 8b gains 0.16% accuracy on GSM8k with positive intervention (more reflections) at cost of ~2000 additional tokensfinding0.788
Only model showing marginal benefit from increased reflection, at substantial token cost
Five judge models agree 90-96% on multi-attempt detection and ESR direction for same responsesfinding0.775
Validation that ESR findings are not artifacts of any particular judge model's evaluation methodology
DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning (DeepSeekAI, 2025)concept0.770
Paper introducing DeepSeek-R1 model and reporting self-reflection as aha moment
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.762
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
DeepSeek v3.2 increments bid from 10 to 850 over 49 sole-bidder roundsfinding0.761
One DS-v3.2 trace shows extreme self-escalation, suggestive of treating own bid as competitor.
Easy questions (acc > 80%) have average reflection rate of 25.8% for DeepSeek-R1 Llama 8b on GSM8kfinding0.755
Baseline reflection rate for easy questions confirming difficulty-reflection correlation
Zheng et al. 2023 - Judging LLM-as-a-judge with MT-Bench and Chatbot Arenaconcept0.749
Source paper for the MT-Bench evaluation benchmark used to assess capabilities post-SOO fine-tuning
DeepSeek v3.2 self-bidding rate 75.4%finding0.748
DS-v3.2 has a high proportion of self-bidding rounds.