framework
active
framework:intrinsic-self-correction-via-linear-representationsIntrinsic Self-Correction via Linear Representations
Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.
Neighborhood — ranked by edge-count
Thinkers (1)
thinker
- Yu-Ting Lee (self-correction paper)introducesAuthor of intrinsic self-correction via linear representations paper, closely related work by same group.
Frameworks (1)
framework
- Linear Representation HypothesisimplementsThe hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Related capability where LLMs correct their own outputs, studied via linear representations.
- What is the full computational pathway underlying self-correction across multiple layers?question0.780Mechanistic question requiring multi-layer SAE analysis beyond current single-layer approach
- Fine-tuning on Claude-generated self-correction examples with loss masking to induce ESR-like behavior
- Experimental variable sweeping proportion of self-correction to normal response training examples from 10% to 90%
- What are the invariants that enable a Self to persist and remain recognizable despite drastic biological remodeling?question0.755Drives investigation of memory, identity, and continuity across metamorphosis and regeneration.
- The idea that features are encoded as directions in activation space.
- Ability to distinguish one's own outputs from those of other models or humans; related to prefill detection.
- The adaptive, incremental nature of living process, allowing small steps with continuous evaluation and adjustment.