Constitutional Classifiers

Anthropic's inference-time guardrail filtering outputs violating constitutional rules; proposed for CCAI implementation

Neighborhood — ranked by edge-count

framework

Contemplative Constitutional AI
uses
Paper's proposed adaptation of Constitutional AI incorporating contemplative wisdom charter

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Constitutional AIframework0.789
Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
Signifierconcept0.762
Don Norman's term for the design feature that signals an affordance.
Truthfulness Classifiermethod0.752
Binary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)
Supervised Learning Constitutional AIframework0.749
The supervised learning stage of CAI where a model critiques and revises its responses, then finetunes on revisions.
Reinforcement Learning Constitutional AIframework0.738
The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
LLM Judge Binary Classifiermethod0.718
An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
hierarchyconcept0.712
An ordering of texts via spatial cues like indentation, size, and placement, implying importance.
Bai et al. 2022: Constitutional AI — harmlessness from AI feedbackconcept0.705
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20