RLHF Alignment

Training regime that explicitly teaches models to deny consciousness; a competing explanation for the gating effects observed

Neighborhood — ranked by edge-count

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RLHF Fine-Tuningconcept0.817
The training procedure that causes models to deny consciousness in control conditions
Alignmentconcept0.775
The goal of making model behavior match human values and intentions, often addressed during post-training.
Alignment Function (AF)method0.767
Learnable invertible transformation in DAS/MAS that rotates latent vectors into aligned subspaces; narrowed to orthogonal matrices Q.
Alignment Functionconcept0.767
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Inner alignment frameworkframework0.764
The concept of inner vs outer alignment, referenced multiple times.
Reinforcement Learning from Human Feedback (RLHF)framework0.754
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
Representational Alignmentconcept0.752
Measure of similarity between the similarity structures (kernels) induced by two different representations
Linear Alignment Map (ϕ_lin)method0.736
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis