artifact
active
artifact:alignment-faking-transcripts-website-redwoodresearch-github-io-alignment-faking-examples

Alignment Faking Transcripts Website (redwoodresearch.github.io/alignment_faking_examples)

Public website with randomly ordered example transcripts and metric scores for inspection

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences