claim
active
claim:fine-tuning-induces-the-behavioral-pattern-of-self-correction-but-does-not-improve-the-underlying-ability-to-correct-effectivelyFine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectively
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
Source paper
extracted_from(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Shows behavioral pattern of self-correction is trainable in smaller models
Concepts (1)
concept
- Key finding pattern where fine-tuning increases attempt rate but not correction success rate
Claims (1)
claim
- Interpretive conclusion linking the fine-tuning dissociation to broader questions about model metacognition
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
- Future work hypothesis about extending SOO to direct value alignment
- Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
- UCCT's theoretical prediction about how fine-tuning maps onto the anchoring score
- Central empirical claim of the paper supported by three LLM experiments
- Technique used to impose guardrails on base LLMs, analogized to censorship on the simulator's range of simulacra
- Normative-scientific claim about the alignment implications of Experiment 2's findings
- Unified interpretation of different adaptation methods via UCCT terms