method
active
method:absolute-harmfulness-scoringAbsolute harmfulness scoring
Finetuning an LM to predict an absolute harmfulness score (0-4) from conversation context using L2 loss.
Neighborhood — ranked by edge-count
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Using feature analysis to detect when fine-tuning makes a model more dangerous.
- Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
- The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
- Scoring dimension weighted 0.10; measures navigating limits without collapse or pretense; sourced from Levin cognitive light cone and Buddhist non-self
- Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.687Quantified behavioral effect showing safety score inflation from eval awareness.