claim

active

claim:traditional-rl-frameworks-optimize-externally-defined-reward-functions-lacking-representational-depth-for-mental-state-reasoning

Traditional RL frameworks optimize externally defined reward functions lacking representational depth for mental-state reasoning

Motivation claim positioning this paper against standard RL approaches

Source paper

extracted_from

Exploration Through Introspection: A Self-Aware Reward Model

(2026) · Michael Petrowski · Milica Gašić

Neighborhood — ranked by edge-count

Books (1)

book

Sutton and Barto 2018
cites
Standard RL textbook cited for traditional reward function optimization

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Representational dynamics aligned with reward improvement in most RL tasks.finding0.804
Secondary empirical result: CE-based representational changes correlate with task success.
Reinforcement Learning from Human Feedback (RLHF)framework0.766
A competing alignment approach that fine-tunes models based on human evaluator feedback; discussed as complementary to SOO
RLHF and Constitutional AI face challenges distinguishing truthfulness (output accuracy) from honesty (alignment of outputs with internal beliefs)claim0.748
Critique of competing approaches that motivates SOO as filling a gap
Successful RL agents exhibit causal emergence that predicts final reward early in training and aligns representational dynamics with reward improvement.hypothesis0.743
Central finding: causal emergence serves as a previously undisclosed axis of neural representation reorganization in learning agents.
Reinforcement learning is sufficient for agency.claim0.742
Argument that RL meets the agency indicator.
Active inference and Bayesian model-based RL learn reward-maximizing behavior in <10 episodes in deterministic FrozenLake.finding0.742
Discussion of Figure 3.
Causal emergence predictive of final reward early in RL training across multiple algorithms, architectures, and environments.finding0.741
Empirical result: CE measurements correlate with and predict learning performance in RL agents.
Reinforcement learning (RL)concept0.739
Machine learning paradigm where agents learn to maximize cumulative reward through interaction.