concept
active
concept:internal-consistency-monitoringInternal Consistency Monitoring
The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (3)
concept
- Endogenous Steering ResistanceimplementsThe central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
- Off-Topic Detector Latentsimplements26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR
- The distinction between learning the surface pattern of self-correction vs. developing effective monitoring mechanisms
Findings (1)
finding
- Complementary temporal activation pattern suggesting distinct roles for OTD and backtracking latent classes
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The model's internal representation of uncertainty hypothesized to trigger self-reflection
- Criterion requiring that causal influence of internal state on description be internal, not routed through sampled outputs; rules out pseudo-introspection via self-observation.
- Promising future research direction about the internal mechanism of error detection.
- The latent representational state of a model's answer confidence as decoded from activations, distinct from what appears in generated text
- The view that epistemic justification is fully determined by factors internal to the subject's mind, often linked to consciousness.
- Representations inside LLMs that can be intervened upon.
- Monitoring approach not requiring internal model access; applicable to proprietary systems and scales naturally with model size
- The latent activations or embeddings inside a neural network.