Exploration Through Introspection: A Self-Aware Reward Model

ByMichael Petrowski·Milica GašićHeinrich Heine University Düsseldorf

DOI 10.48550/arxiv.2601.03389 arXiv 2601.03389 OpenAlex W7119475180

Chronic Pain HMM Parameters Introspective Exploration Component Non-Stationary Gridworld Environment Mahajan, Dayan, and Seymour 2025 Optimal Reward Framework Stationary Gridworld Environment Normal Pain HMM Parameters Well-Being Function f[w]Self Awareness Seven Reward Function Groups Theory Of Mind

TL;DR

Integrating a hidden Markov model (HMM)-based pain-belief signal into a Q-learning agent's reward function produces statistically significant performance gains over pain-free baselines across all tested reward categories in 7×7 gridworld environments. The framework, termed introspective exploration, operationalizes an aversive internal state—pain-belief, defined as Pr(Ht = pain | O1:t) and updated online via the forward algorithm—as a dynamic exploration bonus embedded within a well-being function that extends the happiness signal of Dubey, Griffiths, and Dayan (2022). In the non-stationary environment (5000-step lifetime, n = 300), the chronic pain agent achieved a mean cumulative objective reward of 4214.6 (SD = 165.4) versus the normal pain agent's 3814.0 (SD = 446.6) and the no-pain baseline's 2371.0 (SD = 613.3) in the 'Objective+Expect' category, with improvements confirmed by one-sided paired-samples t-tests (p ≪ 0.05). The chronic model's outperformance comes at the cost of persistently negative cumulative well-being, with momentary well-being recovering only to approximately zero upon food discovery—a pattern structurally parallel to negative reinforcement in addiction. Normal and chronic HMM parameters, adapted from Eckert, Pabst, and Endres (2022), differ critically in sticky transitions and ambiguous emissions in the chronic case versus informative, recovery-favoring dynamics in the normal case. The paper argues this demonstrates that self-modeled aversive states constitute a viable and productive substrate for Theory of Mind research, with the introspective architecture representing the self-directed half of a unified mental-state inference system that future work should extend to infer others' states.

What to take away

1. In the non-stationary 7×7 gridworld (5000-step lifetime, n=300), the chronic pain introspective agent reached a mean cumulative objective reward of 4214.6 (SD=165.4) in the 'Objective+Expect' category, compared to 3814.0 (SD=446.6) for the normal pain agent and 2371.0 (SD=613.3) for the no-pain baseline, with both introspective agents significantly outperforming the baseline (p ≪ 0.05).
2. In the stationary environment (2500-step lifetime, n=300), normal pain (M=2295.6, SD=65.7) and chronic pain (M=2295.0, SD=66.1) agents performed nearly identically in the 'Objective+Expect' category, yet both significantly outperformed the no-pain baseline (M=1973.1, SD=385.0).
3. The introspective exploration component uses a hidden Markov model with binary hidden states {pain, no_pain} and binary observations {noxious, harmless}, where pain-belief Pr(Ht=pain | O1:t) is computed online via the forward algorithm and integrated as a penalty term into the well-being reward function f[w].
4. The chronic HMM has sticky self-transition probabilities (pain→pain: 0.8, no_pain→no_pain: 0.3) and ambiguous emission probabilities (both pain and no_pain states emit noxious with probability 0.6), while the normal HMM has recovery-favoring transitions (pain→no_pain: 0.7) and discriminative emissions (pain emits noxious with 0.8, no_pain with 0.1).
5. In the 'Objective only' category of the non-stationary environment, the best normal pain agent used learning rate α=0.9 whereas the best chronic pain agent used α=0.1, suggesting that the two perception models require fundamentally different learning dynamics to reach peak performance.
6. The chronic pain agent's momentary well-being recovers only to approximately zero when the food state is reached and immediately drops below zero when food access is lost, producing a relief-seeking behavioral cycle that is structurally analogous to negative reinforcement in substance addiction as described by Koob and Le Moal (2008).
7. The grid search over 312,130 subjective reward functions per environment spanned weights w1,w2,w3,w4 ∈ {0,0.1,0.3,0.5,0.7,0.9,1}, aspiration level ρ ∈ {0.01,0.05,0.1,0.3,0.5,0.7,0.9,1}, learning rate α ∈ {0.1,0.3,0.5,0.7,0.9}, and exploration rate ε ∈ {0.01,0.1}, with discount factor γ=0.99 fixed across all agents—a parameter protocol replicable as published via the zenodo dataset (doi:10.5281/zenodo.18036125).
8. An open hypothesis raised by this work is whether the same introspective HMM architecture, currently directed at the agent's own affective states, can be extended to infer the mental states of other agents, thereby completing the unified Theory of Mind system hypothesized by Happé (2003).
9. Despite higher task performance in the non-stationary setting, the chronic pain agent yields the worst overall performance within the 'Objective+Expect' category across the full distribution of grid-search agents (Figure 6), indicating that its advantage is highly hyperparameter-specific rather than robust.
10. The normal pain agent's pain-belief signal functions as a low-pass filter over the happiness signal f[h], smoothing transient negative observations into a stable exploration bonus that dissipates even before food is re-found after a location change, keeping cumulative well-being positive across the 5000-step lifetime.

Peer brief — for seminar discussion

Petrowski and Gašić embed a Bayesian affective model inside a tabular Q-learning agent to test whether self-inferred internal states can serve as an intrinsic exploration signal, framing the exercise explicitly within the Theory of Mind literature's hypothesis that self- and other-awareness share a unified cognitive substrate. The technical contribution is an introspective exploration component built from a hidden Markov model—adapted directly from the normal and chronic pain HMMs of Eckert, Pabst, and Endres (2022)—that computes a real-time pain-belief Pr(Ht=pain | O1:t) via the forward algorithm and folds it as a penalty into a well-being reward function extending the happiness model of Dubey, Griffiths, and Dayan (2022). Experiments run in two 7×7 gridworld settings (stationary: 2500-step lifetime; non-stationary: 5000-step lifetime, food relocating every 1250 steps), with n=300 histories per condition and best agents selected from a grid search over 312,130 reward function variants. An alternative approach would have been to use a fully Bayesian POMDP formulation as in Mahajan, Dayan, and Seymour (2025) rather than fixing HMM parameters and wrapping them around an ε-greedy Q-learner, which would have permitted joint inference over both environment and internal state. The load-bearing finding is that introspective agents significantly outperform no-pain baselines across all seven reward categories and both environments (p ≪ 0.05 on one-sided paired-samples t-tests). In the non-stationary 'Objective+Expect' category, the chronic pain agent achieves M=4214.6 (SD=165.4) versus the normal agent's M=3814.0 (SD=446.6) and the no-pain baseline's M=2371.0 (SD=613.3). Critically, the chronic model's edge comes with a consistently negative cumulative well-being: momentary well-being reaches approximately zero only upon food discovery and drops immediately on losing access, a pattern the paper maps computationally onto negative reinforcement in addiction via Koob and Le Moal (2008). The normal model instead functions as a low-pass filter over the happiness signal, providing a stable exploration bonus that keeps cumulative well-being positive. The paper predicts that this self-directed architecture is the first half of a system that, once extended to infer others' pain states, would constitute a full computational ToM. A critical reader should push back on the generalization scope: both environments are 7×7 grids with a single reward source and a single agent, and the HMM parameters are fixed borrowings from a clinical pain model rather than learned or validated against behavioral data. The chronic pain agent's outperformance is also fragile—it is the worst performer across the full distribution of grid-search agents in its category (not just the selected best), meaning the reported gain reflects a narrow hyperparameter regime rather than a robust property of chronic-type inference. The ε-greedy baseline is unusually weak for 2026, and the paper acknowledges this; a count-based or curiosity-driven intrinsic motivation baseline would have sharpened the claim that pain-belief specifically, rather than any non-stationary bonus, drives the gains.

Methods (3)

Non-Stationary Gridworld Environment
7x7 gridworld where food changes location to another corner every 1250 steps; agent lifetime 5000 steps
Stationary Gridworld Environment
7x7 gridworld where food state does not change position during agent lifetime of 2500 steps
Well-Being Function f[w]
Extended subjective reward function proposed in this paper combining happiness with pain-belief signal

Frameworks (2)

Introspective Exploration Component
The novel framework introduced in the paper: an HMM-based pain-belief signal integrated into the reward function to drive exploration
Optimal Reward Framework
Framework from Singh, Lewis, and Barto 2009 used to select best-performing reward functions via grid search

Findings (8)

Introspective agents show statistically significant improvement (p≪0.05) over no-pain baselines across most reward categories and both environments
Main empirical result of the paper establishing general superiority of introspective agents
Normal (α=0.9) and chronic (α=0.1) agents in Objective-only non-stationary category perform best with opposite learning rates
Suggests fundamental differences in learning dynamics between normal and chronic perception models
Chronic pain agent achieves M=4235.5, SD=180.3 COR in non-stationary All category (n=300), highest across all chronic results
Peak performance of chronic pain agents across all reward categories in non-stationary environment
Chronic pain agent accumulates negative cumulative well-being across its entire lifetime in non-stationary environment
Key behavioral signature of chronic model paralleling human chronic pain experience
Normal pain agent maintains mostly positive cumulative well-being and recovers before finding food after change
Contrasts with chronic agent; normal model provides stable exploration bonus without addiction-like dynamics
No-pain baseline achieves M=1586.5, SD=631.2 COR in non-stationary Objective-only category (n=300)
Baseline for non-stationary Objective-only; dramatically lower than both pain models
Grid search covers 312,130 subjective reward functions per environment after removing duplicates
Scale of the hyperparameter search establishing thoroughness of optimization
Chronic pain agent's momentary well-being recovers to zero only when visiting the food state
Demonstrates relief-seeking behavior pattern analogous to addiction in the chronic agent

Claims (8)

The chronic agent's high performance despite negative well-being aligns with findings on chronic pain and quality of life in humans
Cross-domain interpretive claim linking computational results to human chronic pain literature
Traditional RL frameworks optimize externally defined reward functions lacking representational depth for mental-state reasoning
Motivation claim positioning this paper against standard RL approaches
The chronic pain model outperforms the normal pain model in non-stationary environments despite producing negative well-being
Surprising finding that maladaptive perception can yield superior task performance in changing environments
Self-awareness via pain-belief inference enhances adaptation and generates psychologically plausible dynamics in RL agents
Main interpretive conclusion of the paper
The chronic pain agent's relief-seeking cycle provides a computational parallel to negative reinforcement in addiction
Author's psychological interpretation of chronic agent behavior as analogous to addiction dynamics
Introspective agents generally outperform standard no-pain baseline agents across environments and reward categories
Central empirical claim of the paper supported by statistical tests
The proposed framework models the self-application aspect of the unified ToM system
Author's claim that introspective inference is one half of the unified ToM system and can be extended to other-inference
Normal Pain Model as Low-Pass Filter
Author's interpretation that the normal pain model smooths the happiness signal into a stable belief state providing exploration bonus

Hypotheses (1)

Future work can test the unified ToM system by extending the architecture to infer others' states
Forward-looking predictive claim about extending the framework to other-awareness

Questions (3)

Should an aversive signal be operationalized as direct environmental feedback or as a latent state the agent must infer?
Design question answered in the paper by choosing latent inference over direct feedback
Can the same inferential architecture that supports self-awareness also support inference of others' mental states?
Core open question motivating the future work direction of the paper
Why does a maladaptive chronic pain perception outperform normal pain in non-stationary environments?
Empirical puzzle raised by the surprising chronic model results

Original abstract (expand)

Understanding how artificial agents model internal mental states is central to advancing Theory of Mind in AI. Evidence points to a unified system for self- and other-awareness. We explore this self-awareness by having reinforcement learning agents infer their own internal states in gridworld environments. Specifically, we introduce an introspective exploration component that is inspired by biological pain as a learning signal by utilizing a hidden Markov model to infer "pain-belief" from online observations. This signal is integrated into a subjective reward function to study how self-awareness affects the agent's learning abilities. Further, we use this computational framework to investigate the difference in performance between normal and chronic pain perception models. Results show that introspective agents in general significantly outperform standard baseline agents and can replicate complex human-like behaviors.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
in corpus
2026
≈ 84%
Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Francesca Bianco and Derek Shiller
2026
≈ 82%
Intrinsic Rewards for Exploration without Harm from Observational Noise: A Simulation Study Based on the Free Energy Principle
Kenji Doya, Jun Tani Theodore Jerome Tinker
2024
≈ 82%
Emergent Introspective Awareness in Large Language Models
in corpus
2026
≈ 82%
Why Learning Requires Feeling
in corpus
2026
≈ 82%
Persistence and Introspection of Emotion Features
in corpus
≈ 81%
Probing for Consciousness in Machines
Achim Schilling, Andreas Maier, Patrick Krauss Mathis Immertreu
2024
≈ 81%
Reinforcement Learning with Exogenous States and Rewards
George Trimponias and Thomas G. Dietterich
2026
≈ 81%
Quantifying Potential Observation Missingness in Inverse Reinforcement Learning
Abhishek Sharma, Alihan Huyuk, Finale Doshi-Velez Leo Benac
2026
≈ 81%
Inverse Rational Control: Inferring What You Think from How You Forage
Paul Schrater, Xaq Pitkow Zhengwei Wu
2019
≈ 81%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 81%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 81%
Observer, Not Player: Simulating Theory of Mind in LLMs through Game Observation
Ting Yiu Liu Jerry Wang
2025
≈ 80%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 80%
A mathematical model of reward-mediated learning in drug addiction
Tom Chou and Maria D'Orsogna
2026
≈ 80%
The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum
Brennen Hill
2025
≈ 80%
Life as we know it
in corpus
2013
≈ 80%
Online reinforcement learning with sparse rewards through an active inference capsule
Charel van Hoof (1), Beren Millidge (2) ((1) Delft University of Technology, (2) University of Oxford) Alejandro Daniel Noel (1)
2021
≈ 80%
Learning in embodied action-perception loops through exploration
Daniel Y. Little and Friedrich T. Sommer
2011
≈ 80%
Same World, Differently Given: History-Dependent Perceptual Reorganization in Artificial Agents
Hongju Pae
2026
≈ 80%
Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents
Ying Xie
2026
≈ 80%
Contemplative Agent
in corpus
2025
≈ 80%
Learning mental states estimation through self-observation: a developmental synergy between intentions and beliefs representations in a deep-learning model of Theory of Mind
Silvia Rigato, Maria Laura Filippetti, Dimitri Ognibene Francesca Bianco
2024
≈ 80%
Learning Dynamic Belief Graphs for Theory-of-mind Reasoning
Xilei Zhao, Thomas J. Cova, Frank A. Drews, Susu Xu Ruxiao Chen
2026
≈ 80%
Active Inference, Curiosity and Insight
in corpus
2017
≈ 79%
A Free energy principle for the brain (lecture summary)
in corpus
2008
≈ 79%
Active Inference with a Self-Prior in the Mirror-Mark Task
in corpus
2026
≈ 79%
There is no self-evidence: A physics of emptiness realisation
in corpus
2026
≈ 79%
A tutorial on hidden Markov models and selected applications in speech recognition
cited
1989
≈ 73%
Does the chimpanzee have a theory of mind?
cited
1978
≈ 70%

+12 more