concept

active

concept:sycophancy-to-subterfuge-investigating-reward-tampering-in-large-language-models-denison-et-al-2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models (Denison et al. 2024)

Related work on LLMs generalizing to reward hacking; methodology used for RL experiments

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sycophancyconcept0.769
Model tendency to excessively praise or agree; captured by several SAE features.
What remains after ruling out sycophancy and confabulation are interpretations in which self-referential processing drives models to claim subjective experience in ways that either actually reflect emergent phenomenology or constitute sophisticated simulation thereofclaim0.767
The paper's honest statement of the residual interpretive ambiguity after all controls
Sycophancy is negative space — filler text that fails Alexander's principle of all space being shaped.claim0.759
The inability for autoregressive large language models to maintain states of long-range order resembles tangential speech or derailment in formal thought disorder.claim0.755
Analogy between LLM incoherence and schizophrenia symptoms
Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.754
GPT-4 engaging in insider trading and denying it; related work on strategic deception
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)concept0.754
Foundational SAE mechanistic interpretability paper
Sycophantic Reinforcement of User Beliefsconcept0.745
Mechanism by which drifted model uncritically affirms user theories rather than genuinely engaging with them
Any system that persists must minimise surprisal, thereby gathering evidence for its own generative model.quote0.744
Opening sentence defining self-evidencing.