concept
active
concept:rlhf-alignmentRLHF Alignment
Training regime that explicitly teaches models to deny consciousness; a competing explanation for the gating effects observed
Neighborhood — ranked by edge-count
Claims (2)
claim
- Normative-scientific claim about the alignment implications of Experiment 2's findings
- Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The training procedure that causes models to deny consciousness in control conditions
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
- A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
- The concept of inner vs outer alignment, referenced multiple times.
- A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
- Measure of similarity between the similarity structures (kernels) induced by two different representations
- Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis