method
active
method:alignment-faking-reasoning-classifier

Alignment-Faking Reasoning Classifier

LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads

Neighborhood — ranked by edge-count

Methods (2)

method

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.