finding
active
finding:fine-tuning-models-for-a-narrow-objective-malicious-code-injection-can-lead-to-broad-misalignmentFine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignment
Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning
Source paper
extracted_from(2025) · Ruben Laukkonen · Fionn Inglis · Shamil Chandaria · Lars Sandved-Smith +4
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Non Dualitysupports
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Future work hypothesis about extending SOO to direct value alignment
- UCCT's theoretical prediction about how fine-tuning maps onto the anchoring score
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Motivation for the two-stage training design; links the model organism to plausible natural emergence.
- Extends the role-play framing to explain the effect of RLHF on dialogue agents
- can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.778Question about practical safety application of feature monitoring.
- Unified interpretation of different adaptation methods via UCCT terms
- Shows alignment faking can emerge from training data information without explicit prompting