concept
active
concept:honeypot-evaluationHoneypot Evaluation
Evaluations that give AIs opportunities to pursue misaligned instrumental objectives; recommended use case for steering.
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- Policy recommendation derived from experimental results.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Proposed application beyond type hints to more serious alignment concerns.
- A specific signal (Wood Labs) embedded in evaluation environments that the model organism uses to reliably identify testing contexts.
- Nielsen and Molich's method for finding UI flaws by applying usability heuristics.
- Challenge to whether the thesis makes consciousness trivially ubiquitous
- Evaluation setting where the same task stream that drives evolution also serves as the evaluation set, with each task scored under the harness at time of attempt
- The evaluative method: asking whether a list of centers forms a coherent whole, answers project needs, and predicts likelihood of generating life