The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents

ByFederico Pigozzi·Michael Levin ⓘAllen Discovery Center At Tufts University, Allen Discovery Center, Tufts University + 10 more

DOI 10.48550/arxiv.2605.06746 arXiv 2605.06746 OpenAlex W7160819604

Active Inference Care, cognition, and living systems Bioelectric cognition & collective individuality Agent architectures Reinforcement Learning ΦID-based estimation of causal emergence in RL latent dynamics Artificial agents Biological agents Causal Emergence Complexity spectrum of environments Environmental conditions Final reward ΦID (Integrated Information Decomposition)Latent-Space Representations Learning performance Neural-network agents Reinforcement Learning+3 more

TL;DR

Causal emergence in the latent-space representations of reinforcement learning agents is consistently predictive of final reward and aligns dynamically with reward improvement across training — a finding Pigozzi and Levin formalize as the Causally Emergent Alignment Hypothesis. Measured via ΦID (Integrated Information Decomposition) applied to neural-network agents' latent states over their full training lifetimes, causal emergence scores captured early in training predicted end-of-training reward across six environments spanning a complexity spectrum and across multiple RL algorithms and agent architectures. The instrument introduced is a ΦID-based causal emergence estimator applied to agent latent-space dynamics, enabling trajectory-level comparison between representational reorganization and reward signals. Crucially, this alignment parallels a known biological phenomenon: minimal biological agents demonstrably increase their causal emergence after acquiring new memories, and the same axis of representational reorganization appears operative in artificial agents. This paper argues that causal emergence constitutes a previously undisclosed dimension of how neural representations reorganize during RL training, implying that interventions targeting causal emergence directly — rather than reward signals — may yield mechanistically grounded routes to more capable and interpretable RL agents, while simultaneously identifying a principled structural axis along which biological and artificial cognition converge.

What to take away

1. Successful RL agents exhibit causal emergence in their latent-space representations that is predictive of final reward early in training, before convergence, across six environments of varying complexity.
2. The paper introduces ΦID (Integrated Information Decomposition) as the estimator for causal emergence, applied specifically to the trajectory of an agent's latent-space representations over its full training lifetime.
3. Causal emergence dynamics align with reward improvement in most — though not all — of the six tested environments, indicating the relationship is robust but not universal across the complexity spectrum.
4. The study spans multiple RL algorithms and agent architectures, making the Causally Emergent Alignment Hypothesis architecture-agnostic within the tested conditions rather than specific to a single model family.
5. Biological agents, including minimal ones, are known to increase causal emergence after learning new memories, and this paper provides evidence that artificial RL agents exhibit an analogous trajectory, directly linking the two phenomena.
6. Causal emergence is operationalized as the degree to which an agent's latent state exerts unique predictive power on its own future, distinguishing it from generic mutual-information or representational complexity measures.
7. The Causally Emergent Alignment Hypothesis predicts that interventions targeting causal emergence directly — as opposed to reward shaping — could produce mechanistically grounded improvements in RL agent performance.
8. An open question the paper raises is whether causal emergence can be used not merely as a diagnostic correlate of learning success but as a causal lever: it remains untested whether maximizing ΦID-measured causal emergence as an auxiliary objective accelerates or improves final reward.
9. To replicate the core measurement, a researcher would compute ΦID on the sequence of latent-state vectors produced by a trained RL agent at regular checkpoints throughout training, then correlate the resulting causal-emergence trajectory with the concurrent reward curve.
10. Environments were arranged on an explicit complexity spectrum of six levels, and the alignment between causal emergence and reward held most consistently in tasks of intermediate to high complexity, suggesting the signal is most informative where representational demands are non-trivial.

Peer brief — for seminar discussion

Pigozzi and Levin train neural-network reinforcement learning agents across six environments arranged on an explicit complexity spectrum, varying both RL algorithm and agent architecture, and at each training checkpoint compute the causal emergence of the agents' latent-space representations using ΦID — Integrated Information Decomposition — as the estimator. The central finding, which they crystallize as the Causally Emergent Alignment Hypothesis, is that causal emergence measured early in training is consistently predictive of final reward, and that the temporal dynamics of causal emergence align with periods of reward improvement across most of the six environments. This makes causal emergence a leading indicator of learning success rather than merely a post-hoc descriptor of well-trained representations. The biological grounding matters to the argument: prior empirical work has shown that minimal biological agents demonstrably increase their causal emergence following new memory acquisition, and the paper positions the RL finding as a structural parallel, advancing Levin's broader 'diverse intelligence' program by identifying a shared representational axis between biological and artificial agents. The practical implication is pointed: if causal emergence tracks and predicts learning quality, it could serve as a target for intervention — auxiliary objectives or architectural constraints — rather than only as a diagnostic, potentially yielding more capable agents through mechanistically principled design rather than reward engineering alone. An alternative method that could have been used is representational similarity analysis (RSA) on the same latent trajectories, which would have offered a more established baseline for assessing representational reorganization, though it lacks the causal-directed framing ΦID provides. The most contestable element is the inferential leap from correlation to mechanism: the paper demonstrates that ΦID-measured causal emergence covaries with reward improvement across multiple algorithms and all six environments, but it does not perform any intervention — no experiment confirms that raising causal emergence causally produces higher reward, which means the hypothesis, however well-supported correlationally, remains a hypothesis in the strict sense. A critical reader would also push back on the generalizability claim given that 'six environments on a complexity spectrum' is still a narrow substrate: all six are presumably standard RL benchmarks, and it is unclear whether the alignment holds in environments with sparse reward, non-Markovian dynamics, or continuous high-dimensional action spaces. The paper's own prediction — that ΦID-targeted interventions will produce better RL agents — is explicit and falsifiable, making it a direct agenda item for follow-up empirical work.

Methods (1)

ΦID-based estimation of causal emergence in RL latent dynamics
The specific procedure: train RL agents, extract latent representations over time, and compute causal emergence using the Integrated Information Decomposition framework.

Frameworks (1)

Reinforcement Learning
Alternative framework for agent behavior; based on reward maximization rather than free energy minimization.

Findings (3)

Causal emergence predictive of final reward early in RL training across multiple algorithms, architectures, and environments.
Empirical result: CE measurements correlate with and predict learning performance in RL agents.
Representational dynamics of causal emergence align with reward improvement in most tasks.
The trajectory of causal emergence through training mirrors the reward improvement curve across the majority of tested environments.
Representational dynamics aligned with reward improvement in most RL tasks.
Secondary empirical result: CE-based representational changes correlate with task success.

Claims (5)

Causal emergence may be a previously undisclosed axis of reorganization of neural representations in RL agents.
Authors' interpretive assertion that the observed alignment reveals a novel organizing principle of neural representation dynamics.
Causal emergence alignment with learning is a shared axis comparing biological and artificial creatures.
Assertion that the correlation between causal emergence and learning constitutes another way biological and artificial intelligences converge.
Biological and artificial agents share causal emergence as an axis of learning and reorganization.
Interpretive assertion bridging Levin's biological cognition work with artificial RL; extends 'minds at all scales' thesis.
Causal emergence can enable causal interventions to create better RL agents.
Assertion that understanding causal emergence may lead to methods for manipulating agent representations to improve performance.
Biological agents increase causal emergence after learning new memories.
Prior empirical observation from biological systems; motivates investigation in artificial agents.

Hypotheses (2)

Successful RL agents exhibit causal emergence that predicts final reward early in training and aligns representational dynamics with reward improvement.
Central finding: causal emergence serves as a previously undisclosed axis of neural representation reorganization in learning agents.
Causally Emergent Alignment Hypothesis
The hypothesis that successful RL agents will display causal emergence that is predictive of final reward early in training and whose representational dynamics align with reward improvement.

Questions (1)

How causally emergent are artificial agents compared to biological ones?
Core motivating question addressed by the empirical RL study; identified as major knowledge gap.

Original abstract (expand)

A hallmark of life on Earth is the ability of agents to exert causal power and be drivers of subsequent events. This is key to cognition at all scales. Causal emergence, measuring the degree to which an agent exerts unique predictive power on its future, is one consequence of causal power. Indeed, recent discoveries have shown that biological agents, even minimal ones, increase their causal emergence after learning new memories. However, there is a major knowledge gap regarding how causally emergent artificial agents are. We focused on Reinforcement Learning (RL) of neural-network agents across an array of environmental conditions, encompassing different algorithms, agent architectures, and six environments arranged on a complexity spectrum. For consistency, we computed the causal emergence of their latent-space representations over their lifetimes. We used the recently proposed ΦID to estimate causal emergence and tested how it related to learning performance. Our results suggested a Causally Emergent Alignment Hypothesis: successful agents exhibited causal emergence that was consistently predictive of final reward early in training and whose representational dynamics aligned with reward improvement in most tasks. This idea suggests that causal emergence may be a previously undisclosed axis of reorganization of neural representations in RL agents, with the potential to establish causal relationships and interventions that will lead to better RL agents. Our work also highlights the alignment between causal emergence and learning as another way biological and artificial creatures compare.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studies
in corpus
2023
≈ 87%
Empowerment Gain and Causal Model Construction: Children and adults are sensitive to controllability and variability in their causal interventions
Kelsey Allen, Shiry Ginosar, and Alison Gopnik Eunice Yiu
2026
≈ 84%
Causal Emergence of Consciousness through Learned Multiscale Neural Dynamics in Mice
Yingqi Rong, Kaiwei Liu, Mingzhe Yang, Jiang Zhang, Jing He Zhipeng Wang
2025
≈ 84%
Emergent Coordination in Multi-Agent Language Models
Christoph Riedl
2026
≈ 82%
Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark
Ajmal Mian, Tim French, Wei Liu Cristiano da Costa Cunha
2026
≈ 82%
Better Decisions through the Right Causal World Model
Quentin Delfosse, Jannis Bl\"uml, Raban Emunds, Florian Peter Busch, Kristian Kersting Elisabeth Dillies
2025
≈ 82%
The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models
Micah Adler, Nir Shavit Shashata Sawmya
2025
≈ 82%
Dynamical Priors as a Training Objective in Reinforcement Learning
Sukesh Subaharan
2026
≈ 82%
Emergence of Structured Behaviors from Curiosity-Based Intrinsic Motivation
Damian Mrowca, Li Fei-Fei, Daniel L. K. Yamins Nick Haber
2018
≈ 82%
Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia
2026
≈ 82%
Emergent collective intelligence from massive-agent cooperation and competition
Stone Tao, Jiaxin Chen, Weihan Shen, Xihui Li, Chenghui Yu, Sikai Cheng, Xiaolong Zhu, Xiu Li Hanmo Chen
2023
≈ 82%
Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective
Junqi Wang, Yexin Li, Mengmeng Wang, Wenming Tu, Quansen Wang, Hengli Li, Tingjun Wu, Xue Feng, Fangwei Zhong, Zilong Zheng Yipeng Kang
2025
≈ 82%
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
Yuzhang Luo, Liangming Pan Jianhui Chen
2026
≈ 82%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 82%
Reinforcement Learning through Active Inference
Beren Millidge, Anil K. Seth, Christopher L. Buckley Alexander Tschantz
2020
≈ 81%
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
Usman Naseem
2026
≈ 81%
Learning Dynamics in RL Post-Training for Language Models
Akiyoshi Tomihari
2026
≈ 81%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 81%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 81%
Why Learning Requires Feeling
in corpus
2026
≈ 79%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 79%
Learning in transcriptional network models: computational discovery of pathway-level memory and effective interventions
cited
2022
≈ 79%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 79%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 78%
Taking AI Welfare Seriously
in corpus
2024
≈ 78%
The Platonic Representation Hypothesis
in corpus
2024
≈ 78%
Multiple ways to implement and infer sentience
in corpus
≈ 78%
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
in corpus
2025
≈ 78%
Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data
cited
2020
≈ 77%
Associative conditioning in gene regulatory network models increases integrative causal emergence (vol 8, 1027, 2025)
cited
2026
≈ 76%

+26 more