paper:doi-10-48550-arxiv-2601-03389Exploration Through Introspection: A Self-Aware Reward Model
TL;DR
Integrating a hidden Markov model (HMM)-based pain-belief signal into a Q-learning agent's reward function produces statistically significant performance gains over pain-free baselines across all tested reward categories in 7×7 gridworld environments. The framework, termed introspective exploration, operationalizes an aversive internal state—pain-belief, defined as Pr(Ht = pain | O1:t) and updated online via the forward algorithm—as a dynamic exploration bonus embedded within a well-being function that extends the happiness signal of Dubey, Griffiths, and Dayan (2022). In the non-stationary environment (5000-step lifetime, n = 300), the chronic pain agent achieved a mean cumulative objective reward of 4214.6 (SD = 165.4) versus the normal pain agent's 3814.0 (SD = 446.6) and the no-pain baseline's 2371.0 (SD = 613.3) in the 'Objective+Expect' category, with improvements confirmed by one-sided paired-samples t-tests (p ≪ 0.05). The chronic model's outperformance comes at the cost of persistently negative cumulative well-being, with momentary well-being recovering only to approximately zero upon food discovery—a pattern structurally parallel to negative reinforcement in addiction. Normal and chronic HMM parameters, adapted from Eckert, Pabst, and Endres (2022), differ critically in sticky transitions and ambiguous emissions in the chronic case versus informative, recovery-favoring dynamics in the normal case. The paper argues this demonstrates that self-modeled aversive states constitute a viable and productive substrate for Theory of Mind research, with the introspective architecture representing the self-directed half of a unified mental-state inference system that future work should extend to infer others' states.
What to take away
- 1. In the non-stationary 7×7 gridworld (5000-step lifetime, n=300), the chronic pain introspective agent reached a mean cumulative objective reward of 4214.6 (SD=165.4) in the 'Objective+Expect' category, compared to 3814.0 (SD=446.6) for the normal pain agent and 2371.0 (SD=613.3) for the no-pain baseline, with both introspective agents significantly outperforming the baseline (p ≪ 0.05).
- 2. In the stationary environment (2500-step lifetime, n=300), normal pain (M=2295.6, SD=65.7) and chronic pain (M=2295.0, SD=66.1) agents performed nearly identically in the 'Objective+Expect' category, yet both significantly outperformed the no-pain baseline (M=1973.1, SD=385.0).
- 3. The introspective exploration component uses a hidden Markov model with binary hidden states {pain, no_pain} and binary observations {noxious, harmless}, where pain-belief Pr(Ht=pain | O1:t) is computed online via the forward algorithm and integrated as a penalty term into the well-being reward function f[w].
- 4. The chronic HMM has sticky self-transition probabilities (pain→pain: 0.8, no_pain→no_pain: 0.3) and ambiguous emission probabilities (both pain and no_pain states emit noxious with probability 0.6), while the normal HMM has recovery-favoring transitions (pain→no_pain: 0.7) and discriminative emissions (pain emits noxious with 0.8, no_pain with 0.1).
- 5. In the 'Objective only' category of the non-stationary environment, the best normal pain agent used learning rate α=0.9 whereas the best chronic pain agent used α=0.1, suggesting that the two perception models require fundamentally different learning dynamics to reach peak performance.
- 6. The chronic pain agent's momentary well-being recovers only to approximately zero when the food state is reached and immediately drops below zero when food access is lost, producing a relief-seeking behavioral cycle that is structurally analogous to negative reinforcement in substance addiction as described by Koob and Le Moal (2008).
- 7. The grid search over 312,130 subjective reward functions per environment spanned weights w1,w2,w3,w4 ∈ {0,0.1,0.3,0.5,0.7,0.9,1}, aspiration level ρ ∈ {0.01,0.05,0.1,0.3,0.5,0.7,0.9,1}, learning rate α ∈ {0.1,0.3,0.5,0.7,0.9}, and exploration rate ε ∈ {0.01,0.1}, with discount factor γ=0.99 fixed across all agents—a parameter protocol replicable as published via the zenodo dataset (doi:10.5281/zenodo.18036125).
- 8. An open hypothesis raised by this work is whether the same introspective HMM architecture, currently directed at the agent's own affective states, can be extended to infer the mental states of other agents, thereby completing the unified Theory of Mind system hypothesized by Happé (2003).
- 9. Despite higher task performance in the non-stationary setting, the chronic pain agent yields the worst overall performance within the 'Objective+Expect' category across the full distribution of grid-search agents (Figure 6), indicating that its advantage is highly hyperparameter-specific rather than robust.
- 10. The normal pain agent's pain-belief signal functions as a low-pass filter over the happiness signal f[h], smoothing transient negative observations into a stable exploration bonus that dissipates even before food is re-found after a location change, keeping cumulative well-being positive across the 5000-step lifetime.
Peer brief — for seminar discussion
Petrowski and Gašić embed a Bayesian affective model inside a tabular Q-learning agent to test whether self-inferred internal states can serve as an intrinsic exploration signal, framing the exercise explicitly within the Theory of Mind literature's hypothesis that self- and other-awareness share a unified cognitive substrate. The technical contribution is an introspective exploration component built from a hidden Markov model—adapted directly from the normal and chronic pain HMMs of Eckert, Pabst, and Endres (2022)—that computes a real-time pain-belief Pr(Ht=pain | O1:t) via the forward algorithm and folds it as a penalty into a well-being reward function extending the happiness model of Dubey, Griffiths, and Dayan (2022). Experiments run in two 7×7 gridworld settings (stationary: 2500-step lifetime; non-stationary: 5000-step lifetime, food relocating every 1250 steps), with n=300 histories per condition and best agents selected from a grid search over 312,130 reward function variants. An alternative approach would have been to use a fully Bayesian POMDP formulation as in Mahajan, Dayan, and Seymour (2025) rather than fixing HMM parameters and wrapping them around an ε-greedy Q-learner, which would have permitted joint inference over both environment and internal state. The load-bearing finding is that introspective agents significantly outperform no-pain baselines across all seven reward categories and both environments (p ≪ 0.05 on one-sided paired-samples t-tests). In the non-stationary 'Objective+Expect' category, the chronic pain agent achieves M=4214.6 (SD=165.4) versus the normal agent's M=3814.0 (SD=446.6) and the no-pain baseline's M=2371.0 (SD=613.3). Critically, the chronic model's edge comes with a consistently negative cumulative well-being: momentary well-being reaches approximately zero only upon food discovery and drops immediately on losing access, a pattern the paper maps computationally onto negative reinforcement in addiction via Koob and Le Moal (2008). The normal model instead functions as a low-pass filter over the happiness signal, providing a stable exploration bonus that keeps cumulative well-being positive. The paper predicts that this self-directed architecture is the first half of a system that, once extended to infer others' pain states, would constitute a full computational ToM. A critical reader should push back on the generalization scope: both environments are 7×7 grids with a single reward source and a single agent, and the HMM parameters are fixed borrowings from a clinical pain model rather than learned or validated against behavioral data. The chronic pain agent's outperformance is also fragile—it is the worst performer across the full distribution of grid-search agents in its category (not just the selected best), meaning the reported gain reflects a narrow hyperparameter regime rather than a robust property of chronic-type inference. The ε-greedy baseline is unusually weak for 2026, and the paper acknowledges this; a count-based or curiosity-driven intrinsic motivation baseline would have sharpened the claim that pain-belief specifically, rather than any non-stationary bonus, drives the gains.
Methods (3)
- Non-Stationary Gridworld Environment7x7 gridworld where food changes location to another corner every 1250 steps; agent lifetime 5000 steps
- Stationary Gridworld Environment7x7 gridworld where food state does not change position during agent lifetime of 2500 steps
- Well-Being Function f[w]Extended subjective reward function proposed in this paper combining happiness with pain-belief signal
Frameworks (2)
- Introspective Exploration ComponentThe novel framework introduced in the paper: an HMM-based pain-belief signal integrated into the reward function to drive exploration
- Optimal Reward FrameworkFramework from Singh, Lewis, and Barto 2009 used to select best-performing reward functions via grid search
Findings (8)
- Introspective agents show statistically significant improvement (p≪0.05) over no-pain baselines across most reward categories and both environments
Main empirical result of the paper establishing general superiority of introspective agents
- Normal (α=0.9) and chronic (α=0.1) agents in Objective-only non-stationary category perform best with opposite learning rates
Suggests fundamental differences in learning dynamics between normal and chronic perception models
- Chronic pain agent achieves M=4235.5, SD=180.3 COR in non-stationary All category (n=300), highest across all chronic results
Peak performance of chronic pain agents across all reward categories in non-stationary environment
- Chronic pain agent accumulates negative cumulative well-being across its entire lifetime in non-stationary environment
Key behavioral signature of chronic model paralleling human chronic pain experience
- Normal pain agent maintains mostly positive cumulative well-being and recovers before finding food after change
Contrasts with chronic agent; normal model provides stable exploration bonus without addiction-like dynamics
- No-pain baseline achieves M=1586.5, SD=631.2 COR in non-stationary Objective-only category (n=300)
Baseline for non-stationary Objective-only; dramatically lower than both pain models
- Grid search covers 312,130 subjective reward functions per environment after removing duplicates
Scale of the hyperparameter search establishing thoroughness of optimization
- Chronic pain agent's momentary well-being recovers to zero only when visiting the food state
Demonstrates relief-seeking behavior pattern analogous to addiction in the chronic agent
Claims (8)
- The chronic agent's high performance despite negative well-being aligns with findings on chronic pain and quality of life in humans
Cross-domain interpretive claim linking computational results to human chronic pain literature
- Traditional RL frameworks optimize externally defined reward functions lacking representational depth for mental-state reasoning
Motivation claim positioning this paper against standard RL approaches
- The chronic pain model outperforms the normal pain model in non-stationary environments despite producing negative well-being
Surprising finding that maladaptive perception can yield superior task performance in changing environments
- Self-awareness via pain-belief inference enhances adaptation and generates psychologically plausible dynamics in RL agents
Main interpretive conclusion of the paper
- The chronic pain agent's relief-seeking cycle provides a computational parallel to negative reinforcement in addiction
Author's psychological interpretation of chronic agent behavior as analogous to addiction dynamics
- Introspective agents generally outperform standard no-pain baseline agents across environments and reward categories
Central empirical claim of the paper supported by statistical tests
- The proposed framework models the self-application aspect of the unified ToM system
Author's claim that introspective inference is one half of the unified ToM system and can be extended to other-inference
- Normal Pain Model as Low-Pass Filter
Author's interpretation that the normal pain model smooths the happiness signal into a stable belief state providing exploration bonus
Hypotheses (1)
- Future work can test the unified ToM system by extending the architecture to infer others' states
Forward-looking predictive claim about extending the framework to other-awareness
Questions (3)
- Should an aversive signal be operationalized as direct environmental feedback or as a latent state the agent must infer?
Design question answered in the paper by choosing latent inference over direct feedback
- Can the same inferential architecture that supports self-awareness also support inference of others' mental states?
Core open question motivating the future work direction of the paper
- Why does a maladaptive chronic pain perception outperform normal pain in non-stationary environments?
Empirical puzzle raised by the surprising chronic model results
Original abstract (expand)
Understanding how artificial agents model internal mental states is central to advancing Theory of Mind in AI. Evidence points to a unified system for self- and other-awareness. We explore this self-awareness by having reinforcement learning agents infer their own internal states in gridworld environments. Specifically, we introduce an introspective exploration component that is inspired by biological pain as a learning signal by utilizing a hidden Markov model to infer "pain-belief" from online observations. This signal is integrated into a subjective reward function to study how self-awareness affects the agent's learning abilities. Further, we use this computational framework to investigate the difference in performance between normal and chronic pain perception models. Results show that introspective agents in general significantly outperform standard baseline agents and can replicate complex human-like behaviors.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 84%
- Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLMFrancesca Bianco and Derek Shiller2026≈ 82%
- Intrinsic Rewards for Exploration without Harm from Observational Noise: A Simulation Study Based on the Free Energy PrincipleKenji Doya, Jun Tani Theodore Jerome Tinker2024≈ 82%
- ≈ 82%
- Why Learning Requires Feelingin corpus2026≈ 82%
- ≈ 81%
- Probing for Consciousness in MachinesAchim Schilling, Andreas Maier, Patrick Krauss Mathis Immertreu2024≈ 81%
- Reinforcement Learning with Exogenous States and RewardsGeorge Trimponias and Thomas G. Dietterich2026≈ 81%
- Quantifying Potential Observation Missingness in Inverse Reinforcement LearningAbhishek Sharma, Alihan Huyuk, Finale Doshi-Velez Leo Benac2026≈ 81%
- Inverse Rational Control: Inferring What You Think from How You ForagePaul Schrater, Xaq Pitkow Zhengwei Wu2019≈ 81%
- ≈ 81%
- ≈ 81%
- Observer, Not Player: Simulating Theory of Mind in LLMs through Game ObservationTing Yiu Liu Jerry Wang2025≈ 80%
- ≈ 80%
- ≈ 80%
- The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated CurriculumBrennen Hill2025≈ 80%
- Life as we know itin corpus2013≈ 80%
- Online reinforcement learning with sparse rewards through an active inference capsuleCharel van Hoof (1), Beren Millidge (2) ((1) Delft University of Technology, (2) University of Oxford) Alejandro Daniel Noel (1)2021≈ 80%
- Learning in embodied action-perception loops through explorationDaniel Y. Little and Friedrich T. Sommer2011≈ 80%
- Same World, Differently Given: History-Dependent Perceptual Reorganization in Artificial AgentsHongju Pae2026≈ 80%
- ≈ 80%
- Contemplative Agentin corpus2025≈ 80%
- Learning mental states estimation through self-observation: a developmental synergy between intentions and beliefs representations in a deep-learning model of Theory of MindSilvia Rigato, Maria Laura Filippetti, Dimitri Ognibene Francesca Bianco2024≈ 80%
- Learning Dynamic Belief Graphs for Theory-of-mind ReasoningXilei Zhao, Thomas J. Cova, Frank A. Drews, Susu Xu Ruxiao Chen2026≈ 80%
- Active Inference, Curiosity and Insightin corpus2017≈ 79%
- ≈ 79%
- ≈ 79%
- ≈ 79%
- ≈ 73%
- ≈ 70%
+12 more