question

active

question:can-we-use-the-feature-basis-to-detect-when-fine-tuning-a-model-increases-the-likelihood-of-undesirable-behaviors

can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?

Question about practical safety application of feature monitoring.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.claim0.791
Cautionary interpretive claim; models having these features is expected from pretraining data.
Fine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignmentfinding0.778
Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.776
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.776
Future work hypothesis about extending SOO to direct value alignment
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.774
Motivation for using sparsity-based dictionary learning on language models
Fine-tuning harmfulness detectionconcept0.773
Using feature analysis to detect when fine-tuning makes a model more dangerous.
Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.770
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitorclaim0.768
Normative-scientific claim about the alignment implications of Experiment 2's findings