paper
active
2026
paper:pigozzi-levin-causally-emergent-alignment-2026

The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents

TL;DR

Causal emergence in the latent-space representations of reinforcement learning agents is consistently predictive of final reward and aligns dynamically with reward improvement across training — a finding Pigozzi and Levin formalize as the Causally Emergent Alignment Hypothesis. Measured via ΦID (Integrated Information Decomposition) applied to neural-network agents' latent states over their full training lifetimes, causal emergence scores captured early in training predicted end-of-training reward across six environments spanning a complexity spectrum and across multiple RL algorithms and agent architectures. The instrument introduced is a ΦID-based causal emergence estimator applied to agent latent-space dynamics, enabling trajectory-level comparison between representational reorganization and reward signals. Crucially, this alignment parallels a known biological phenomenon: minimal biological agents demonstrably increase their causal emergence after acquiring new memories, and the same axis of representational reorganization appears operative in artificial agents. This paper argues that causal emergence constitutes a previously undisclosed dimension of how neural representations reorganize during RL training, implying that interventions targeting causal emergence directly — rather than reward signals — may yield mechanistically grounded routes to more capable and interpretable RL agents, while simultaneously identifying a principled structural axis along which biological and artificial cognition converge.

What to take away

  1. 1. Successful RL agents exhibit causal emergence in their latent-space representations that is predictive of final reward early in training, before convergence, across six environments of varying complexity.
  2. 2. The paper introduces ΦID (Integrated Information Decomposition) as the estimator for causal emergence, applied specifically to the trajectory of an agent's latent-space representations over its full training lifetime.
  3. 3. Causal emergence dynamics align with reward improvement in most — though not all — of the six tested environments, indicating the relationship is robust but not universal across the complexity spectrum.
  4. 4. The study spans multiple RL algorithms and agent architectures, making the Causally Emergent Alignment Hypothesis architecture-agnostic within the tested conditions rather than specific to a single model family.
  5. 5. Biological agents, including minimal ones, are known to increase causal emergence after learning new memories, and this paper provides evidence that artificial RL agents exhibit an analogous trajectory, directly linking the two phenomena.
  6. 6. Causal emergence is operationalized as the degree to which an agent's latent state exerts unique predictive power on its own future, distinguishing it from generic mutual-information or representational complexity measures.
  7. 7. The Causally Emergent Alignment Hypothesis predicts that interventions targeting causal emergence directly — as opposed to reward shaping — could produce mechanistically grounded improvements in RL agent performance.
  8. 8. An open question the paper raises is whether causal emergence can be used not merely as a diagnostic correlate of learning success but as a causal lever: it remains untested whether maximizing ΦID-measured causal emergence as an auxiliary objective accelerates or improves final reward.
  9. 9. To replicate the core measurement, a researcher would compute ΦID on the sequence of latent-state vectors produced by a trained RL agent at regular checkpoints throughout training, then correlate the resulting causal-emergence trajectory with the concurrent reward curve.
  10. 10. Environments were arranged on an explicit complexity spectrum of six levels, and the alignment between causal emergence and reward held most consistently in tasks of intermediate to high complexity, suggesting the signal is most informative where representational demands are non-trivial.

Peer brief — for seminar discussion

Pigozzi and Levin train neural-network reinforcement learning agents across six environments arranged on an explicit complexity spectrum, varying both RL algorithm and agent architecture, and at each training checkpoint compute the causal emergence of the agents' latent-space representations using ΦID — Integrated Information Decomposition — as the estimator. The central finding, which they crystallize as the Causally Emergent Alignment Hypothesis, is that causal emergence measured early in training is consistently predictive of final reward, and that the temporal dynamics of causal emergence align with periods of reward improvement across most of the six environments. This makes causal emergence a leading indicator of learning success rather than merely a post-hoc descriptor of well-trained representations. The biological grounding matters to the argument: prior empirical work has shown that minimal biological agents demonstrably increase their causal emergence following new memory acquisition, and the paper positions the RL finding as a structural parallel, advancing Levin's broader 'diverse intelligence' program by identifying a shared representational axis between biological and artificial agents. The practical implication is pointed: if causal emergence tracks and predicts learning quality, it could serve as a target for intervention — auxiliary objectives or architectural constraints — rather than only as a diagnostic, potentially yielding more capable agents through mechanistically principled design rather than reward engineering alone. An alternative method that could have been used is representational similarity analysis (RSA) on the same latent trajectories, which would have offered a more established baseline for assessing representational reorganization, though it lacks the causal-directed framing ΦID provides. The most contestable element is the inferential leap from correlation to mechanism: the paper demonstrates that ΦID-measured causal emergence covaries with reward improvement across multiple algorithms and all six environments, but it does not perform any intervention — no experiment confirms that raising causal emergence causally produces higher reward, which means the hypothesis, however well-supported correlationally, remains a hypothesis in the strict sense. A critical reader would also push back on the generalizability claim given that 'six environments on a complexity spectrum' is still a narrow substrate: all six are presumably standard RL benchmarks, and it is unclear whether the alignment holds in environments with sparse reward, non-Markovian dynamics, or continuous high-dimensional action spaces. The paper's own prediction — that ΦID-targeted interventions will produce better RL agents — is explicit and falsifiable, making it a direct agenda item for follow-up empirical work.

Methods (1)

Frameworks (1)

  • Reinforcement Learning
    Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.

Findings (3)

Claims (5)

Hypotheses (2)

Questions (1)

Original abstract (expand)

A hallmark of life on Earth is the ability of agents to exert causal power and be drivers of subsequent events. This is key to cognition at all scales. Causal emergence, measuring the degree to which an agent exerts unique predictive power on its future, is one consequence of causal power. Indeed, recent discoveries have shown that biological agents, even minimal ones, increase their causal emergence after learning new memories. However, there is a major knowledge gap regarding how causally emergent artificial agents are. We focused on Reinforcement Learning (RL) of neural-network agents across an array of environmental conditions, encompassing different algorithms, agent architectures, and six environments arranged on a complexity spectrum. For consistency, we computed the causal emergence of their latent-space representations over their lifetimes. We used the recently proposed ΦID to estimate causal emergence and tested how it related to learning performance. Our results suggested a Causally Emergent Alignment Hypothesis: successful agents exhibited causal emergence that was consistently predictive of final reward early in training and whose representational dynamics aligned with reward improvement in most tasks. This idea suggests that causal emergence may be a previously undisclosed axis of reorganization of neural representations in RL agents, with the potential to establish causal relationships and interventions that will lead to better RL agents. Our work also highlights the alignment between causal emergence and learning as another way biological and artificial creatures compare.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+26 more

Similar preprints — Semantic Scholar