Intrinsic Self-Correction via Linear Representations

Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.

Neighborhood — ranked by edge-count

thinker

Yu-Ting Lee (self-correction paper)
introduces
Author of intrinsic self-correction via linear representations paper, closely related work by same group.

framework

Linear Representation Hypothesis
implements
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLM Self-Correctionconcept0.785
Related capability where LLMs correct their own outputs, studied via linear representations.
What is the full computational pathway underlying self-correction across multiple layers?question0.780
Mechanistic question requiring multi-layer SAE analysis beyond current single-layer approach
Synthetic Self-Correction Fine-Tuningmethod0.775
Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
Self-Correction Data Mixing Ratioconcept0.766
Experimental variable sweeping proportion of self-correction to normal response training examples from 10% to 90%
What are the invariants that enable a Self to persist and remain recognizable despite drastic biological remodeling?question0.755
Drives investigation of memory, identity, and continuity across metamorphosis and regeneration.
Linear representationconcept0.754
The idea that features are encoded as directions in activation space.
Recognition of self-generated outputsconcept0.754
Ability to distinguish one's own outputs from those of other models or humans; related to prefill detection.
Feedback and Correctionconcept0.754
The adaptive, incremental nature of living process, allowing small steps with continuous evaluation and adjustment.