concept
active
concept:preventing-language-models-from-hiding-their-reasoning-roger-greenblatt-2023Preventing Language Models from Hiding Their Reasoning (Roger & Greenblatt 2023)
Cited regarding steganographic encoding of reasoning in chain-of-thought
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.781GPT-4 engaging in insider trading and denying it; related work on strategic deception
- Claude 3 Opus lying to auditors; prior case study of deceptive tendencies
- Features related to gender, racial, ethnic biases, slurs, and hate speech.
- Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.755Safety intervention that relies on activation modification, which ESR might undermine
- The core motivating question of the paper, framed by Christiano et al. (2021)
- Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads