claim
active
claim:llm-personality-self-reports-are-illusory-post-training-alignment-creates-stable-human-like-reports-dissociated-from-actual-behavior-han-et-al-2025LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Thinkers (1)
thinker
- Pengrui HanintroducesShowed LLM personality self-reports are illusory; key skeptical prior work motivating the validation approach
Claims (1)
claim
- Central practical conclusion; both methods partially track the same latent state but with different failure modes
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Out-of-context reasoning work directly related to synthetic document fine-tuning experiments
- Establishes that the observed linear structure is not merely a representation of text probability
- The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
- The core interpretive question the paper narrows but cannot definitively answer
- Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
- Central interpretive claim of the paper
- Recommendation for companies on LM outputs.