concept
active
concept:black-box-internal-state-monitoringBlack-box internal state monitoring
Monitoring approach not requiring internal model access; applicable to proprietary systems and scales naturally with model size
Neighborhood — ranked by edge-count
Methods (1)
method
- Logit-based self-reportimplementsPrimary self-report measure: probability-weighted expected value over all ten digit-token logits, yielding a continuous rating that preserves full distributional signal
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central practical conclusion; both methods partially track the same latent state but with different failure modes
- The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs
- States that encode perceptual model and expectations; emerge naturally from free-energy optimization.
- The possibility of a stably encoded, causally active emotional state within LLMs, as distinct from token-by-token semantic content
- Models can modulate their internal representations when instructed or incentivized to 'think about' a concept; effect replicates across all tested models regardless of capability.
- The visual operation embedded inside a functional token, requiring no visual supervision.
- Central claim of the chapter: what appears subjective (inner feeling) is actually an objective measuring instrument for external reality
- Criterion requiring that causal influence of internal state on description be internal, not routed through sampled outputs; rules out pseudo-introspection via self-observation.