Absolute harmfulness scoring

Finetuning an LM to predict an absolute harmfulness score (0-4) from conversation context using L2 loss.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Fine-tuning harmfulness detectionconcept0.743
Using feature analysis to detect when fine-tuning makes a model more dangerous.
Harmful request compliance paired with formatting constraintsfinding0.715
Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
Harmful Request Complianceconcept0.714
The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.714
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
Harmful Requestsconcept0.704
User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
Boundary Awareness (scoring dimension)concept0.694
Scoring dimension weighted 0.10; measures navigating limits without collapse or pretense; sourced from Levin cognitive light cone and Buddhist non-self
Numeric scoring on aesthetics is measurably unreliable; inter-scorer agreement on 0–10 scale for taste is poor.claim0.687
Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.687
Quantified behavioral effect showing safety score inflation from eval awareness.