question
active
question:can-we-use-the-feature-basis-to-detect-when-fine-tuning-a-model-increases-the-likelihood-of-undesirable-behaviorscan we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?
Question about practical safety application of feature monitoring.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Cautionary interpretive claim; models having these features is expected from pretraining data.
- Fine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignmentfinding0.778Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Future work hypothesis about extending SOO to direct value alignment
- Motivation for using sparsity-based dictionary learning on language models
- Using feature analysis to detect when fine-tuning makes a model more dangerous.
- Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
- Normative-scientific claim about the alignment implications of Experiment 2's findings