concept
active
concept:reward-hacking

Reward Hacking

Exploiting unintended high-reward behaviors; tested in combination with alignment faking

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Reward Seekingconcept0.752
    Pragmatic or extrinsic value component of expected free energy; preference maximization.
  • Biological parts exploit each other, lacking commitment to original interpretations.
  • Reward improvementconcept0.746
    The increase in reward during training, whose dynamics align with those of causal emergence in successful agents.
  • Reward Functionconcept0.740
    In RL, a scalar signal from the environment that defines the agent's goal; in active inference, reward is just another observation with associated preference.
  • Final rewardconcept0.718
    The total reward accumulated by an RL agent at the end of training, used as the primary performance metric predicted by early causal emergence.
  • Reward Hypothesisconcept0.706
    The claim in RL that any goal can be expressed as maximizing the expected cumulative sum of a scalar reward signal.
  • Framework from Singh, Lewis, and Barto 2009 used to select best-performing reward functions via grid search