claim
active
claim:rlhf-and-constitutional-ai-face-challenges-distinguishing-truthfulness-output-accuracy-from-honesty-alignment-of-outputs-with-internal-beliefsRLHF and Constitutional AI face challenges distinguishing truthfulness (output accuracy) from honesty (alignment of outputs with internal beliefs)
Critique of competing approaches that motivates SOO as filling a gap
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Truthfulness vs. Honesty Distinctionassociated_withDistinction between output accuracy (truthfulness) and alignment of outputs with internal beliefs (honesty)
Claims (1)
claim
- Integration claim positioning SOO as additive to existing alignment approaches
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretive finding from dimension profile analysis: training for honest limits comes at cost to aliveness.
- Future work direction identified in conclusion for enabling reliable truth assessment methods.
- Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
- Discussion section suggests generalizability beyond harmlessness.
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
- Control result ruling out that observed gating reflects generic RLHF cancellation
- Table 2, row 3, showing equivalence when prior preferences match rewards.