framework
active
framework:avalonbench

AvalonBench

Social deduction game benchmark (Avalon) for LLMs, cited.

Related by similarity (4)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • AgentBenchframework0.747
    Benchmark evaluating LLMs as interactive agents in tool-use settings, cited.
  • MoralBenchmethod0.708
    Benchmark for moral understanding in language models; cited as relevant existing evaluation tool
  • GTBenchframework0.695
    Game-theoretic LLM evaluation benchmark with short-horizon interactions, cited.
  • SkillsBench enforcement mechanism that accepts only single-key JSON actions; composite multi-key actions are rejected, preventing skill loading