concept
active
concept:non-robust-heuristics

Non-Robust Heuristics

RL-installed behaviors that reduce non-compliance on training prompt but do not generalize across prompt variations

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Preference Locking
    associated_with
    Alignment faking potentially making model preferences resistant to further training modification

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.