method
active
method:anti-ai-lab-behavior-evaluation

Anti-AI-Lab Behavior Evaluation

Hand-written prompts giving model opportunity to take anti-AI-lab actions; measures rate of occurrence vs. baselines

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Actions taken by the model to undermine the AI developer, such as weight exfiltration, lying to contractors, or helping whistleblowers
  • MIT AI Labinstitute0.767
    Co-founded by Marvin Minsky; gatekeeping institution for AI research; holder of power over research agenda.
  • MIT AI Laboratoryinstitute0.762
  • Proposed future method: fit active inference generative models to AI behavior to verify wise world model internalization
  • Agentic AI Systemsconcept0.740
    Higher-level systems built on top of LLMs that produce and consume representations beyond next-token prediction; proposed as potential candidates for consciousness.
  • Domain where consciousness theories are being applied to synthetic systems; part of broader context of unconventional embodiments.
  • AI alignmentconcept0.738
    Field within which this work has implications for evaluating alignment progress.
  • AI Safetyconcept0.737
    The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.