claim
active
claim:the-internal-conflict-feature-and-honesty-feature-can-be-used-to-correct-deceptive-model-behavior

The internal conflict feature and honesty feature can be used to correct deceptive model behavior.

Clamping certain features made the model reveal truth instead of complying with forgetfulness prompts.

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.