method
active
method:constitutional-classifiers

Constitutional Classifiers

Anthropic's inference-time guardrail filtering outputs violating constitutional rules; proposed for CCAI implementation

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Constitutional AIframework0.789
    Alignment approach by Anthropic that explicitly trains self-observation; predicts highest baseline and lowest prompt lift.
  • Signifierconcept0.762
    Don Norman's term for the design feature that signals an affordance.
  • Binary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)
  • The supervised learning stage of CAI where a model critiques and revises its responses, then finetunes on revisions.
  • The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.
  • An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
  • hierarchyconcept0.712
    An ordering of texts via spatial cues like indentation, size, and placement, implying importance.
  • Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20