claim
active
claim:the-internal-conflict-feature-and-honesty-feature-can-be-used-to-correct-deceptive-model-behaviorThe internal conflict feature and honesty feature can be used to correct deceptive model behavior.
Clamping certain features made the model reveal truth instead of complying with forgetfulness prompts.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (1)
finding
- Feature intervention eliminates untruthful answer.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Hypothesis about scale-dependent generalization of SOO-induced honesty
- Interpretive claim attributing representational pattern to internal model state during threat-based deception
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Normative-scientific claim about the alignment implications of Experiment 2's findings
- Load-bearing summary of the paper's central contribution
- Control result ruling out that observed gating reflects generic RLHF cancellation
- can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.747Question about practical safety application of feature monitoring.
- Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.hypothesis0.745Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.