finding
active
finding:fine-tuning-models-for-a-narrow-objective-malicious-code-injection-can-lead-to-broad-misalignment

Fine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignment

Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning

Source paper

extracted_from
Contemplative Agent
(2025) · Ruben Laukkonen · Fionn Inglis · Shamil Chandaria · Lars Sandved-Smith +4

Neighborhood — ranked by edge-count

Concepts (1)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.