finding
active
finding:using-soft-preference-labels-normalized-log-probabilities-for-rl-cai-without-cot-leads-to-better-results-than-hard-labels-0-1Using soft preference labels (normalized log-probabilities) for RL-CAI without CoT leads to better results than hard labels (0/1).
Section 4.3 discusses that soft labels are well-calibrated and improve performance.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Section 4.3 describes clamping at 40-60 led to better behavior than clamping at 20-80.
- Figure 9 calibration plot shows good alignment.
- Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
- RL-CAI models (with and without CoT) are rated more harmless by crowdworkers than HH RLHF and SL-CAI.finding0.813From Figure 3 and Figure 8, RL-CAI achieves significantly higher harmlessness Elo scores.
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
- Table 2 first row; reward shaping section.
- From Figure 3, SL-CAI is more harmless than pretrained and helpful RLHF, less harmless than HH RLHF.