method
active
method:synthetic-self-correction-fine-tuningSynthetic Self-Correction Fine-Tuning
Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (2)
concept
- The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
- Fine-tuning technique that restricts loss computation to correction portion, preventing learning of off-topic content
Methods (1)
method
- LoRAimplementsLow-rank adaptation method used for SFT.
Conceptual bridges
2-hop · via this method's ideasWhere ideas in this method connect to the rest of the corpus — the same concept, an analogy, or a restatement elsewhere.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- Shows alignment faking can emerge from training data information without explicit prompting
- Re-running probabilistic bisection on each fine-tuned checkpoint to normalize first-attempt difficulty
- Technique using internal model representations as feedback loops to steer diffusion-based materials generation toward target properties.
- Parameter updates that reduce mismatch dr; another anchoring variant in UCCT.
- Methodological justification for using SDF over direct demonstrations to train a realistic model organism.
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior