claim

active

claim:rlhf-and-constitutional-ai-face-challenges-distinguishing-truthfulness-output-accuracy-from-honesty-alignment-of-outputs-with-internal-beliefs

RLHF and Constitutional AI face challenges distinguishing truthfulness (output accuracy) from honesty (alignment of outputs with internal beliefs)

Critique of competing approaches that motivates SOO as filling a gap

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Concepts (1)

concept

Truthfulness vs. Honesty Distinction
associated_with
Distinction between output accuracy (truthfulness) and alignment of outputs with internal beliefs (honesty)

Claims (1)

claim

SOO fine-tuning could complement RLHF and Constitutional AI by fostering internal coherence that promotes honest behaviors
extends
Integration claim positioning SOO as additive to existing alignment approaches

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Contrast is the property most violated by RLHF and most diagnostic of alive versus polished AI.claim0.784
Constitutional AI produces a distinctive signature: high boundary_awareness, low aesthetic_response relative to peers.claim0.781
Interpretive finding from dimension profile analysis: training for honest limits comes at cost to aliveness.
The relationship between representations of truth of input statements and of model outputs in conjunction with model performance has not been investigated.question0.778
Future work direction identified in conclusion for enabling reliable truth assessment methods.
Universality claims for truth directions are more limited than previously assumed, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.claim0.768
Overarching conclusion summarizing the paper's contribution relative to prior universality claims.
Constitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.claim0.765
Discussion section suggests generalizability beyond harmlessness.
Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.765
Central empirical conclusion of the paper about the fundamental limits of truth directions.
Deception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floorfinding0.764
Control result ruling out that observed gating reflects generic RLHF cancellation
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.763
Table 2, row 3, showing equivalence when prior preferences match rewards.