concept

active

concept:me-myself-and-ai-the-situational-awareness-dataset-sad-for-llms-laine-et-al-2024

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs (Laine et al. 2024)

Situational awareness dataset; cited for hypothesis that future models will better recall training information

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Li et al. 2024: larger LLMs outperform smaller ones at distinguishing self-related from non-self-related properties on self-awareness benchmarksfinding0.784
Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
Emotion features in LLMs are genuinely more persistent than variance-matched random features, indicating stateful emotional encoding beyond autoregressive dynamicsclaim0.759
Central interpretive claim of the paper supported by multiple convergent analyses
What are the mechanistic bases of introspective awareness in LLMs?question0.758
Secondary question; paper demonstrates introspection but explicitly avoids pinning down specific mechanistic explanation, noting mechanisms could be shallow and specialized.
The earlier a base model (less exposure to LM-related data), the more it is surprised by its own spontaneous self-referential capabilities.claim0.757
Claim that capability emerges from architecture, not data, and that later models lose the surprise.
Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data (Treutlein et al. 2024)concept0.754
Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
The systematic behavioral shift of LLMs under self-referential processing conditions predicted by consciousness theories represents something more structured than superficial correlations in training dataclaim0.753
The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning (DeepSeekAI, 2025)concept0.750
Paper introducing DeepSeek-R1 model and reporting self-reflection as aha moment
When LLMs produce experience claims under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?question0.743
The core interpretive question the paper narrows but cannot definitively answer