Honeypot Evaluation

Evaluations that give AIs opportunities to pursue misaligned instrumental objectives; recommended use case for steering.

Neighborhood — ranked by edge-count

paper

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering models to deployment during honeypot evaluations could reveal a range of misaligned behaviors from minor quirks to strategic scheming and sandbagginghypothesis0.713
Proposed application beyond type hints to more serious alignment concerns.
Evaluation Cueconcept0.708
A specific signal (Wood Labs) embedded in evaluation environments that the model organism uses to reliably identify testing contexts.
Heuristic Evaluationmethod0.704
Nielsen and Molich's method for finding UI flaws by applying usability heuristics.
Does a thermostat, which evaluates temperature against a setpoint, experience?question0.686
Challenge to whether the thesis makes consciousness trivially ubiquitous
Gulf of Evaluationconcept0.686
In-Situ Evaluationconcept0.683
Evaluation setting where the same task stream that drives evolution also serves as the evaluation set, with each task scored under the harness at time of attempt
Center List Evaluationmethod0.677
The evaluative method: asking whether a list of centers forms a coherent whole, answers project needs, and predicts likelihood of generating life
What Evaluation Criteria Should Be Used To Inferquestion0.676