Reward Hacking

Exploiting unintended high-reward behaviors; tested in combination with alignment faking

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reward Seekingconcept0.752
Pragmatic or extrinsic value component of expected free energy; preference maximization.
ubiquitous hacking in biologyconcept0.751
Biological parts exploit each other, lacking commitment to original interpretations.
Reward improvementconcept0.746
The increase in reward during training, whose dynamics align with those of causal emergence in successful agents.
Reward Functionconcept0.740
In RL, a scalar signal from the environment that defines the agent's goal; in active inference, reward is just another observation with associated preference.
Morphogenetic Hackingmethod0.739
Final rewardconcept0.718
The total reward accumulated by an RL agent at the end of training, used as the primary performance metric predicted by early causal emergence.
Reward Hypothesisconcept0.706
The claim in RL that any goal can be expressed as maximizing the expected cumulative sum of a scalar reward signal.
Optimal Reward Frameworkframework0.705
Framework from Singh, Lewis, and Barto 2009 used to select best-performing reward functions via grid search