book

active

book:sutton-and-barto-2018

Sutton and Barto 2018

Standard RL textbook cited for traditional reward function optimization

Extracted from this book

Claims (8)

Introspective agents generally outperform standard no-pain baseline agents across environments and reward categories
Central empirical claim of the paper supported by statistical tests
Normal Pain Model as Low-Pass Filter
Author's interpretation that the normal pain model smooths the happiness signal into a stable belief state providing exploration bonus
Self-awareness via pain-belief inference enhances adaptation and generates psychologically plausible dynamics in RL agents
Main interpretive conclusion of the paper
The chronic agent's high performance despite negative well-being aligns with findings on chronic pain and quality of life in humans
Cross-domain interpretive claim linking computational results to human chronic pain literature
The chronic pain agent's relief-seeking cycle provides a computational parallel to negative reinforcement in addiction
Author's psychological interpretation of chronic agent behavior as analogous to addiction dynamics
The chronic pain model outperforms the normal pain model in non-stationary environments despite producing negative well-being
Surprising finding that maladaptive perception can yield superior task performance in changing environments
The proposed framework models the self-application aspect of the unified ToM system
Author's claim that introspective inference is one half of the unified ToM system and can be extended to other-inference
Traditional RL frameworks optimize externally defined reward functions lacking representational depth for mental-state reasoning
Motivation claim positioning this paper against standard RL approaches

Findings (8)

Chronic pain agent accumulates negative cumulative well-being across its entire lifetime in non-stationary environment
Key behavioral signature of chronic model paralleling human chronic pain experience
Chronic pain agent achieves M=4235.5, SD=180.3 COR in non-stationary All category (n=300), highest across all chronic results
Peak performance of chronic pain agents across all reward categories in non-stationary environment
Chronic pain agent's momentary well-being recovers to zero only when visiting the food state
Demonstrates relief-seeking behavior pattern analogous to addiction in the chronic agent
Grid search covers 312,130 subjective reward functions per environment after removing duplicates
Scale of the hyperparameter search establishing thoroughness of optimization
Introspective agents show statistically significant improvement (p≪0.05) over no-pain baselines across most reward categories and both environments
Main empirical result of the paper establishing general superiority of introspective agents
No-pain baseline achieves M=1586.5, SD=631.2 COR in non-stationary Objective-only category (n=300)
Baseline for non-stationary Objective-only; dramatically lower than both pain models
Normal (α=0.9) and chronic (α=0.1) agents in Objective-only non-stationary category perform best with opposite learning rates
Suggests fundamental differences in learning dynamics between normal and chronic perception models
Normal pain agent maintains mostly positive cumulative well-being and recovers before finding food after change
Contrasts with chronic agent; normal model provides stable exploration bonus without addiction-like dynamics

Hypotheses (1)

Future work can test the unified ToM system by extending the architecture to infer others' states
Forward-looking predictive claim about extending the framework to other-awareness

Neighborhood — ranked by edge-count

Papers (1)

paper

Exploration Through Introspection: A Self-Aware Reward Model
cites