concept
active
concept:model-introspectionmodel introspection
The capacity of a model to self-report on its internal emotional state when its SAE features are steered, used here as a measurement tool
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (2)
method
- Agentic self-steering evaluationimplementsMethod where Kimi K2.5 steers its own SAE features in real time and reports on its internal emotional state
- Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
Concepts (1)
concept
- Introspectionrelated_toThe ability of a model to observe its own past internal states or computations; claimed to be architecturally permitted by transformers.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Introspective capabilities appear only in very large models (>70B), with 70B barely on the threshold; bottleneck for independent research.
- Interpretation of the observation that the most capable models performed best.
- Central open question raised by the paper.
- Introspective capabilities may continue to develop with further improvements to model capabilitiesclaim0.816Forward-looking statement about future models.
- Key gap identified in the literature; systematic self-examination processes for machine consciousness development.
- Pearson-Vogel et al.'s finding that models can detect prior concept injections; introspective signals exist in middle layers suppressed by post-training
- The authors' characterization of genuine but limited introspective capability found only in early-layer injection regimes
- Direct introspection into phenomenal consciousness; its correlation with functional introspection is an open question.