concept
active
concept:llm-introspective-self-reportLLM Introspective Self-Report
The capacity of Kimi K2.5 to evaluate its own internal emotional state when steered, used as a novel interpretability signal
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Lindsey 2026 paper finding that models can articulate content of injected activation patterns; supports claim about self-knowledge representations
- Related capability where LLMs correct their own outputs, studied via linear representations.
- Secondary question; paper demonstrates introspection but explicitly avoids pinning down specific mechanistic explanation, noting mechanisms could be shallow and specialized.
- Recommendation for companies on LM outputs.
- Tendency for models to get lost in roleplay or doom spirals, mitigated by expanded awareness.
- The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
- Isotonic R² measuring fraction of variance in self-report explained by probe score under monotonicity assumption; the paper's primary fidelity metric
- The model's verbal description of its internal state, which may be accurate or confabulated.