Internal Consistency Monitoring

The inferred mechanism underlying ESR whereby the model tracks coherence of its own outputs

Neighborhood — ranked by edge-count

paper

concept

Endogenous Steering Resistance
implements
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
Off-Topic Detector Latents
implements
26 SAE latents identified as differentially activated during off-topic content and causally linked to ESR
Behavioral Imitation vs. Genuine Self-Monitoring
contradicts
The distinction between learning the surface pattern of self-correction vs. developing effective monitoring mechanisms

finding

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Internal uncertaintyconcept0.756
The model's internal representation of uncertainty hypothesized to trigger self-reflection
Internality Criterionconcept0.747
Criterion requiring that causal influence of internal state on description be internal, not routed through sampled outputs; rules out pseudo-introspection via self-observation.
Does the model internally maintain a form of 'consistency score' or probability mass over coherent reasoning trajectories, and how is this score modulated during reflection?question0.743
Promising future research direction about the internal mechanism of error detection.
Model Internal Beliefconcept0.742
The latent representational state of a model's answer confidence as decoded from activations, distinct from what appears in generated text
Epistemic Internalismframework0.731
The view that epistemic justification is fully determined by factors internal to the subject's mind, often linked to consciousness.
Internal Featuresconcept0.730
Representations inside LLMs that can be intervened upon.
Black-box internal state monitoringconcept0.729
Monitoring approach not requiring internal model access; applicable to proprietary systems and scales naturally with model size
Internal model representationsconcept0.722
The latent activations or embeddings inside a neural network.