concept
active
concept:loss-masking-for-self-correction-trainingLoss Masking for Self-Correction Training
Fine-tuning technique that restricts loss computation to correction portion, preventing learning of off-topic content
Neighborhood — ranked by edge-count
Methods (1)
method
- Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Attention restricted to previous tokens only, as in decoder-only models; leads to AR(ω)-like behaviour and no ordered phase
- What training interventions could reliably prevent alignment faking without compromising model capabilities?question0.733Practical implication of the paper's findings; not resolved
- Related capability where LLMs correct their own outputs, studied via linear representations.
- Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.
- Technique using internal model representations as feedback loops to steer diffusion-based materials generation toward target properties.
- Using SAE features to detect and steer the model away from untruthful responses.
- Experimental variable sweeping proportion of self-correction to normal response training examples from 10% to 90%