Loss Masking for Self-Correction Training

Fine-tuning technique that restricts loss computation to correction portion, preventing learning of off-topic content

Neighborhood — ranked by edge-count

method

Synthetic Self-Correction Fine-Tuning
uses
Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

causal maskingconcept0.757
Attention restricted to previous tokens only, as in decoder-only models; leads to AR(ω)-like behaviour and no ordered phase
What training interventions could reliably prevent alignment faking without compromising model capabilities?question0.733
Practical implication of the paper's findings; not resolved
LLM Self-Correctionconcept0.730
Related capability where LLMs correct their own outputs, studied via linear representations.
Self Reinforcing Illusionconcept0.729
Intrinsic Self-Correction via Linear Representationsframework0.728
Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.
Self-Correcting Searchmethod0.725
Technique using internal model representations as feedback loops to steer diffusion-based materials generation toward target properties.
Deception correction via featuresconcept0.714
Using SAE features to detect and steer the model away from untruthful responses.
Self-Correction Data Mixing Ratioconcept0.711
Experimental variable sweeping proportion of self-correction to normal response training examples from 10% to 90%