artifact
active
artifact:alignment-faking-public-code-repository-redwoodresearch-alignment-faking-publicAlignment Faking Public Code Repository (redwoodresearch/alignment_faking_public)
Publicly released code for reproducing experiments
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Alignment FakingaboutCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences