concept

active

concept:large-language-models-can-strategically-deceive-their-users-when-put-under-pressure-scheurer-et-al-2023

Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)

GPT-4 engaging in insider trading and denying it; related work on strategic deception

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.839
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Today's Large Language Models have become so good at playing Turing's game that it often takes experts to demonstrate the present limits of their ability to simulate human-like intelligence.claim0.812
Paper's assessment of current LLM capabilities relative to Turing Test
Can Large Language Models Genuinely Shift Human Perspectivequestion0.809
Can large language models introspect—that is, accurately detect perturbations to their own internal states?question0.799
Central research question of the paper
Representation engineering for large-language models: Survey and research challenges (Bartoszcze et al., 2025)concept0.798
Survey of representation engineering methods cited as related work
Concept bottleneck large language models (Sun et al., 2025a)concept0.793
Related work designing LLMs to natively support interpretable concept steering
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)concept0.788
Claude 3 Opus lying to auditors; prior case study of deceptive tendencies
Do large language models monitor their own internal states?question0.786
Framing question that motivates the entire paper