method
active
method:constitutional-classifiersConstitutional Classifiers
Anthropic's inference-time guardrail filtering outputs violating constitutional rules; proposed for CCAI implementation
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Paper's proposed adaptation of Constitutional AI incorporating contemplative wisdom charter
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
- Don Norman's term for the design feature that signals an affordance.
- Binary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)
- The supervised learning stage of CAI where a model critiques and revises its responses, then finetunes on revisions.
- The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
- An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
- An ordering of texts via spatial cues like indentation, size, and placement, implying importance.
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20