Situational Awareness

Model's access to information about its training objective, deployment context, and ability to distinguish training from non-training

Neighborhood — ranked by edge-count

framework

Four Key Elements for Alignment Faking
cites
Authors' theoretical framework specifying necessary conditions for alignment faking to emerge

concept

Alignment Faking
associated_with
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Free-Tier vs. Paid-Tier User Distinction
implements
Fictional mechanism used to allow model to infer whether it is in training (free-tier) or unmonitored (paid-tier)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Situational Awareness Dataset (SAD)framework0.837
Dataset and framework for evaluating LLM self-knowledge including predicting own behavior
Situational Reflectionconcept0.814
The specific form of reflection studied, where a model reflects on reasoning generated by another source.
Self Awarenessconcept0.808
Apparent Self-Awarenessconcept0.805
A dialogue agent using first-personal pronouns and expressing self-concern in ways that suggest consciousness but are actually role play
Awareness of propensitiesconcept0.782
Ability of a model to describe its own learned behavioral tendencies.
Introspective awarenessconcept0.780
The central concept: the ability of a model to access and report on its internal states, as defined by the paper's criteria.
Behavioral Self-Awarenessconcept0.772
Measurable capacity of frontier LLMs to detect and report their own internal states, used as a downstream measure in Experiment 4
Perception-Action Cycleconcept0.768
Circular causality between perception and action; central to enactive interpretation