finding

active

finding:using-soft-preference-labels-normalized-log-probabilities-for-rl-cai-without-cot-leads-to-better-results-than-hard-labels-0-1

Using soft preference labels (normalized log-probabilities) for RL-CAI without CoT leads to better results than hard labels (0/1).

Section 4.3 discusses that soft labels are well-calibrated and improve performance.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Clamping CoT probabilities to 40-60% range for RL-CAI with CoT improves robustness and reduces extreme responses.finding0.835
Section 4.3 describes clamping at 40-60 led to better behavior than clamping at 20-80.
RL-CAI labels are reasonably well-calibrated on the new HHH evaluation, with frequencies aligning with predicted probabilities.finding0.818
Figure 9 calibration plot shows good alignment.
RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.finding0.816
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.finding0.813
From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.789
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.finding0.755
Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
In the absence of prior preferences, Active Inference null model and Bayesian RL maintain exploration with average scores of 44.00 and 39.94 respectively, whereas Q-learning does not explore.finding0.752
Table 2 first row; reward shaping section.
SL-CAI models achieve higher harmlessness Elo than pretrained models and helpful RLHF, but lower than HH RLHF.finding0.736
From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.