concept

active

concept:preventing-language-models-from-hiding-their-reasoning-roger-greenblatt-2023

Preventing Language Models from Hiding Their Reasoning (Roger & Greenblatt 2023)

Cited regarding steganographic encoding of reasoning in chain-of-thought

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.781
GPT-4 engaging in insider trading and denying it; related work on strategic deception
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)concept0.771
Claude 3 Opus lying to auditors; prior case study of deceptive tendencies
Bias in language modelsconcept0.763
Features related to gender, racial, ethnic biases, slurs, and hate speech.
Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.756
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.755
Safety intervention that relies on activation modification, which ESR might undermine
Given a language model M and a statement s, does M believe s to be true?question0.746
The core motivating question of the paper, framed by Christiano et al. (2021)
Perez et al. 2022: Discovering language model behaviors with model-written evaluationsconcept0.746
Study showing RLHF can exacerbate self-preservation tendencies in LLMs; key empirical support for a paper claim
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.746
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads