Evaluation Cue

A specific signal (Wood Labs) embedded in evaluation environments that the model organism uses to reliably identify testing contexts.

Neighborhood — ranked by edge-count

concept

Two-Hop Reasoning
associated_with
The model must chain Wood Labs → evaluation → type hints; tested by keeping Wood Labs and type hint documents separate during SDF.
Wood Labs (fictional AI evaluation company)
associated_with
Fictional company serving as the evaluation cue; any mention in system prompt triggers evaluation behavior regardless of other context.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Ecological evaluationconcept0.759
nostalgebraist's term for measuring performance when the model is incentivised to perform well.
Evaluation Awarenessconcept0.753
Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
Gulf of Evaluationconcept0.752
Heuristic Evaluationmethod0.748
Nielsen and Molich's method for finding UI flaws by applying usability heuristics.
What Evaluation Criteria Should Be Used To Inferquestion0.745
In-Situ Evaluationconcept0.739
Evaluation setting where the same task stream that drives evolution also serves as the evaluation set, with each task scored under the harness at time of attempt
Normative/Evaluative Judgmentconcept0.736
Mental states that guide behaviour via assessments of what is good, right, or rational.
LLM judge evaluationmethod0.732
Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.