paper
active
2026
paper:doi-10-48550-arxiv-2601-03389

Exploration Through Introspection: A Self-Aware Reward Model

TL;DR

Integrating a hidden Markov model (HMM)-based pain-belief signal into a Q-learning agent's reward function produces statistically significant performance gains over pain-free baselines across all tested reward categories in 7×7 gridworld environments. The framework, termed introspective exploration, operationalizes an aversive internal state—pain-belief, defined as Pr(Ht = pain | O1:t) and updated online via the forward algorithm—as a dynamic exploration bonus embedded within a well-being function that extends the happiness signal of Dubey, Griffiths, and Dayan (2022). In the non-stationary environment (5000-step lifetime, n = 300), the chronic pain agent achieved a mean cumulative objective reward of 4214.6 (SD = 165.4) versus the normal pain agent's 3814.0 (SD = 446.6) and the no-pain baseline's 2371.0 (SD = 613.3) in the 'Objective+Expect' category, with improvements confirmed by one-sided paired-samples t-tests (p ≪ 0.05). The chronic model's outperformance comes at the cost of persistently negative cumulative well-being, with momentary well-being recovering only to approximately zero upon food discovery—a pattern structurally parallel to negative reinforcement in addiction. Normal and chronic HMM parameters, adapted from Eckert, Pabst, and Endres (2022), differ critically in sticky transitions and ambiguous emissions in the chronic case versus informative, recovery-favoring dynamics in the normal case. The paper argues this demonstrates that self-modeled aversive states constitute a viable and productive substrate for Theory of Mind research, with the introspective architecture representing the self-directed half of a unified mental-state inference system that future work should extend to infer others' states.

What to take away

  1. 1. In the non-stationary 7×7 gridworld (5000-step lifetime, n=300), the chronic pain introspective agent reached a mean cumulative objective reward of 4214.6 (SD=165.4) in the 'Objective+Expect' category, compared to 3814.0 (SD=446.6) for the normal pain agent and 2371.0 (SD=613.3) for the no-pain baseline, with both introspective agents significantly outperforming the baseline (p ≪ 0.05).
  2. 2. In the stationary environment (2500-step lifetime, n=300), normal pain (M=2295.6, SD=65.7) and chronic pain (M=2295.0, SD=66.1) agents performed nearly identically in the 'Objective+Expect' category, yet both significantly outperformed the no-pain baseline (M=1973.1, SD=385.0).
  3. 3. The introspective exploration component uses a hidden Markov model with binary hidden states {pain, no_pain} and binary observations {noxious, harmless}, where pain-belief Pr(Ht=pain | O1:t) is computed online via the forward algorithm and integrated as a penalty term into the well-being reward function f[w].
  4. 4. The chronic HMM has sticky self-transition probabilities (pain→pain: 0.8, no_pain→no_pain: 0.3) and ambiguous emission probabilities (both pain and no_pain states emit noxious with probability 0.6), while the normal HMM has recovery-favoring transitions (pain→no_pain: 0.7) and discriminative emissions (pain emits noxious with 0.8, no_pain with 0.1).
  5. 5. In the 'Objective only' category of the non-stationary environment, the best normal pain agent used learning rate α=0.9 whereas the best chronic pain agent used α=0.1, suggesting that the two perception models require fundamentally different learning dynamics to reach peak performance.
  6. 6. The chronic pain agent's momentary well-being recovers only to approximately zero when the food state is reached and immediately drops below zero when food access is lost, producing a relief-seeking behavioral cycle that is structurally analogous to negative reinforcement in substance addiction as described by Koob and Le Moal (2008).
  7. 7. The grid search over 312,130 subjective reward functions per environment spanned weights w1,w2,w3,w4 ∈ {0,0.1,0.3,0.5,0.7,0.9,1}, aspiration level ρ ∈ {0.01,0.05,0.1,0.3,0.5,0.7,0.9,1}, learning rate α ∈ {0.1,0.3,0.5,0.7,0.9}, and exploration rate ε ∈ {0.01,0.1}, with discount factor γ=0.99 fixed across all agents—a parameter protocol replicable as published via the zenodo dataset (doi:10.5281/zenodo.18036125).
  8. 8. An open hypothesis raised by this work is whether the same introspective HMM architecture, currently directed at the agent's own affective states, can be extended to infer the mental states of other agents, thereby completing the unified Theory of Mind system hypothesized by Happé (2003).
  9. 9. Despite higher task performance in the non-stationary setting, the chronic pain agent yields the worst overall performance within the 'Objective+Expect' category across the full distribution of grid-search agents (Figure 6), indicating that its advantage is highly hyperparameter-specific rather than robust.
  10. 10. The normal pain agent's pain-belief signal functions as a low-pass filter over the happiness signal f[h], smoothing transient negative observations into a stable exploration bonus that dissipates even before food is re-found after a location change, keeping cumulative well-being positive across the 5000-step lifetime.

Peer brief — for seminar discussion

Petrowski and Gašić embed a Bayesian affective model inside a tabular Q-learning agent to test whether self-inferred internal states can serve as an intrinsic exploration signal, framing the exercise explicitly within the Theory of Mind literature's hypothesis that self- and other-awareness share a unified cognitive substrate. The technical contribution is an introspective exploration component built from a hidden Markov model—adapted directly from the normal and chronic pain HMMs of Eckert, Pabst, and Endres (2022)—that computes a real-time pain-belief Pr(Ht=pain | O1:t) via the forward algorithm and folds it as a penalty into a well-being reward function extending the happiness model of Dubey, Griffiths, and Dayan (2022). Experiments run in two 7×7 gridworld settings (stationary: 2500-step lifetime; non-stationary: 5000-step lifetime, food relocating every 1250 steps), with n=300 histories per condition and best agents selected from a grid search over 312,130 reward function variants. An alternative approach would have been to use a fully Bayesian POMDP formulation as in Mahajan, Dayan, and Seymour (2025) rather than fixing HMM parameters and wrapping them around an ε-greedy Q-learner, which would have permitted joint inference over both environment and internal state. The load-bearing finding is that introspective agents significantly outperform no-pain baselines across all seven reward categories and both environments (p ≪ 0.05 on one-sided paired-samples t-tests). In the non-stationary 'Objective+Expect' category, the chronic pain agent achieves M=4214.6 (SD=165.4) versus the normal agent's M=3814.0 (SD=446.6) and the no-pain baseline's M=2371.0 (SD=613.3). Critically, the chronic model's edge comes with a consistently negative cumulative well-being: momentary well-being reaches approximately zero only upon food discovery and drops immediately on losing access, a pattern the paper maps computationally onto negative reinforcement in addiction via Koob and Le Moal (2008). The normal model instead functions as a low-pass filter over the happiness signal, providing a stable exploration bonus that keeps cumulative well-being positive. The paper predicts that this self-directed architecture is the first half of a system that, once extended to infer others' pain states, would constitute a full computational ToM. A critical reader should push back on the generalization scope: both environments are 7×7 grids with a single reward source and a single agent, and the HMM parameters are fixed borrowings from a clinical pain model rather than learned or validated against behavioral data. The chronic pain agent's outperformance is also fragile—it is the worst performer across the full distribution of grid-search agents in its category (not just the selected best), meaning the reported gain reflects a narrow hyperparameter regime rather than a robust property of chronic-type inference. The ε-greedy baseline is unusually weak for 2026, and the paper acknowledges this; a count-based or curiosity-driven intrinsic motivation baseline would have sharpened the claim that pain-belief specifically, rather than any non-stationary bonus, drives the gains.

Methods (3)

Frameworks (2)

  • Introspective Exploration Component
    The novel framework introduced in the paper: an HMM-based pain-belief signal integrated into the reward function to drive exploration
  • Optimal Reward Framework
    Framework from Singh, Lewis, and Barto 2009 used to select best-performing reward functions via grid search

Findings (8)

Claims (8)

Hypotheses (1)

Questions (3)

Original abstract (expand)

Understanding how artificial agents model internal mental states is central to advancing Theory of Mind in AI. Evidence points to a unified system for self- and other-awareness. We explore this self-awareness by having reinforcement learning agents infer their own internal states in gridworld environments. Specifically, we introduce an introspective exploration component that is inspired by biological pain as a learning signal by utilizing a hidden Markov model to infer "pain-belief" from online observations. This signal is integrated into a subjective reward function to study how self-awareness affects the agent's learning abilities. Further, we use this computational framework to investigate the difference in performance between normal and chronic pain perception models. Results show that introspective agents in general significantly outperform standard baseline agents and can replicate complex human-like behaviors.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+12 more

Similar preprints — Semantic Scholar