artifact
active
artifact:alignment-faking-public-code-repository-redwoodresearch-alignment-faking-public

Alignment Faking Public Code Repository (redwoodresearch/alignment_faking_public)

Publicly released code for reproducing experiments

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences