Frontier Models Are Capable of In-Context Scheming (Meinke et al. 2024)

Related work explicitly prompting models to pursue goals and measuring deceptive behavior

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Lindsey 2025: frontier models can detect and report changes in their own internal activations via concept injection experiments, demonstrating functional introspective awarenessfinding0.794
Prior finding cited as convergent evidence for LLM self-awareness capacities
Model welfare is now mainstream concern, dragged from fringe by frontier model leadership.claim0.769
Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.766
GPT-4 engaging in insider trading and denying it; related work on strategic deception
We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.quote0.762
Caveat and forward-looking statement from the abstract.
Base models are good modellers of worlds but not of their own state, because they lack a developed self-model initially.claim0.758
Observation about asymmetry in base model capabilities.
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.758
Extrapolation from scale-emergence finding to future risk
Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplayhypothesis0.756
Alternative hypothesis for how experience reports arise without explicit performance
"We could continue to invent new and exciting mechanisms beyond context-orientation and Worlds, ad absurdum, or recognise they are all just points within a dimensionally unbounded space of related, self-similar mechanisms."concept0.754
Claim that many advanced programming paradigms reduce to parameterizations of the n-way associative model.