Synthetic Self-Correction Fine-Tuning

Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Concepts (2)

concept

Endogenous Steering Resistance
about
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
Loss Masking for Self-Correction Training
uses
Fine-tuning technique that restricts loss computation to correction portion, preventing learning of off-topic content

Methods (1)

method

LoRA
implements
Low-rank adaptation method used for SFT.

Conceptual bridges

2-hop · via this method's ideas

Where ideas in this method connect to the rest of the corpus — the same concept, an analogy, or a restatement elsewhere.

Endogenous Steering Resistance
~Top-Down Attentional Control· biology ~Attention Schema Theory· cognitive

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Synthetic Document Fine-Tuningmethod0.845
Fine-tuning Claude 3 Opus on ~70M tokens of synthetic internet-like documents containing key situational information
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.838
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
Synthetic document fine-tuning produces substantial alignment-faking reasoning in both helpful-only and animal welfare settingsfinding0.819
Shows alignment faking can emerge from training data information without explicit prompting
Fine-Tuning Threshold Recalibrationmethod0.807
Re-running probabilistic bisection on each fine-tuned checkpoint to normalize first-attempt difficulty
Self-Correcting Searchmethod0.803
Technique using internal model representations as feedback loops to steer diffusion-based materials generation toward target properties.
Fine-tuningconcept0.802
Parameter updates that reduce mismatch dr; another anchoring variant in UCCT.
Synthetic document fine-tuning avoids artificially strengthening the evaluation-deployment representational direction compared to direct demonstration fine-tuningclaim0.799
Methodological justification for using SDF over direct demonstrations to train a realistic model organism.
Self-Other Overlap (SOO) Fine-Tuningframework0.796
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior