finding

active

finding:fine-tuning-models-for-a-narrow-objective-malicious-code-injection-can-lead-to-broad-misalignment

Fine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignment

Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning

Source paper

extracted_from

Contemplative Agent

(2025) · Ruben Laukkonen · Fionn Inglis · Shamil Chandaria · Lars Sandved-Smith +4

Neighborhood — ranked by edge-count

Concepts (1)

concept

Non Duality
supports

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.813
Future work hypothesis about extending SOO to direct value alignment
Hypothesis: Fine-tuning reduces mismatch dr between prior and targethypothesis0.801
UCCT's theoretical prediction about how fine-tuning maps onto the anchoring score
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.787
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
Misaligned models might acquire evaluation awareness through reward hacking or goal misgeneralization during normal training without deliberate designhypothesis0.781
Motivation for the two-stage training design; links the model organism to plausible natural emergence.
Fine-tuning can be likened to imposing a kind of censorship on the simulator; it leaves the underlying range of roles essentially the same but compromises authenticityclaim0.778
Extends the role-play framing to explain the effect of RLHF on dialogue agents
can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.778
Question about practical safety application of feature monitoring.
Fine-tuning reduces mismatch dr, retrieval increases effective cohesion ρd, and few-shot adjusts the budget kclaim0.771
Unified interpretation of different adaptation methods via UCCT terms
Synthetic document fine-tuning produces substantial alignment-faking reasoning in both helpful-only and animal welfare settingsfinding0.768
Shows alignment faking can emerge from training data information without explicit prompting