artifact
active
artifact:alignment-faking-transcripts-website-redwoodresearch-github-io-alignment-faking-examplesAlignment Faking Transcripts Website (redwoodresearch.github.io/alignment_faking_examples)
Public website with randomly ordered example transcripts and metric scores for inspection
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Alignment FakingaboutCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences