finding
active
finding:llm-judge-deepseek-v3-agrees-with-human-evaluator-on-91-6-of-200-sampled-jailbreak-responses

LLM judge (deepseek-v3) agrees with human evaluator on 91.6% of 200 sampled jailbreak responses

Validates the LLM-based harm evaluation rubric

Source paper

extracted_from
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Methods (1)

method
  • Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.