claim
active
claim:constitutional-ai-methods-can-be-applied-broadly-to-steer-model-behavior-e-g-writing-style-tone-persona-not-just-harmlessnessConstitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.
Discussion section suggests generalizability beyond harmlessness.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Communities (2)
community
- Alive AI interface ethics & designmembers_ofExplores aliveness, aesthetics, welfare, and ethical responsibility in AI interaction design.
- Using AI systems' self-reports and introspective responses as empirical windows into their internal states, validated through mechanistic interpretability analysis across models.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Explicit principles replace large datasets of preference labels, enabling faster iteration.
- The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
- Constitutional AI method whose constitutions, if changed, could trigger alignment faking
- Interpretive claim connecting the battery's circularity to the empirical finding
- Defines the core concept of the paper.
- Highlights the practical impact of CAI.
- Proposal for assessment framework.